From chaos to clarity: schema-constrained AI for auditable biomedical evidence extraction from full-text PDFs
| dc.contributor.author | Mortezaagha, Pouria | |
| dc.contributor.author | Shaw, Joseph | |
| dc.contributor.author | Sun, Bowen | |
| dc.contributor.author | Rahgozar, Arya | |
| dc.date.accessioned | 2026-05-26T03:50:49Z | |
| dc.date.available | 2026-05-26T03:50:49Z | |
| dc.date.issued | 2026-04-14 | |
| dc.date.updated | 2026-05-26T03:50:49Z | |
| dc.description.abstract | Abstract Background Biomedical evidence synthesis depends on accurate extraction of methodological, laboratory, and outcome variables from full-text research articles. These variables are predominantly embedded in complex scientific PDFs that interleave multi-column text, tables, figures, and captions, making manual abstraction time-intensive, error-prone, and increasingly impractical at the scale of contemporary systematic reviews. Despite advances in layout-aware and multimodal document models, end-to-end extraction systems suitable for evidence synthesis remain constrained by limited throughput, OCR error propagation, and insufficient auditability. Methods We propose a schema-constrained AI extraction system that transforms full-text biomedical PDFs into structured, analysis-ready records by explicitly restricting model inference through typed schemas, controlled vocabularies, and evidence-gated decisions. Documents are ingested using resume-aware hashing, partitioned into page-level and caption-aware chunks, and processed asynchronously under explicit concurrency and rate-limiting controls. A high-accuracy OCR model is guided by multiple domain-specific schemas covering bibliographic metadata, study design, populations, laboratory assays, timing and thresholds, clinical outcomes, and diagnostic performance. Chunk-level outputs are deterministically merged into study-level records using controlled vocabularies, conflict-aware handling of scalar fields, set-based aggregation of list-valued fields, and sentence-level evidence capture to enable traceability and post-hoc audit. Results Applied to a corpus of 734 biomedical articles on direct oral anticoagulant (DOAC) level measurement, the pipeline processed all documents without manual intervention while maintaining stable throughput. Schema-constrained extraction exhibited strong internal consistency, with sentence-level provenance populated for nearly all supported decisions. Iterative schema and prompt refinement yielded substantial improvements in extraction fidelity, particularly for outcome definitions, assay classification, and global coagulation testing. Outputs included reproducible CSV/Parquet datasets and caption-aware multimodal markdown reconstructions supporting efficient expert review. Conclusions Schema-constrained AI extraction enables scalable and auditable extraction of structured evidence from heterogeneous scientific PDFs. By combining deterministic chunking, asynchronous orchestration, controlled vocabularies, sentence-level provenance, and aggregated analytical outputs, the proposed pipeline aligns modern document understanding capabilities with the transparency, reproducibility, and reliability demands of biomedical evidence synthesis. | |
| dc.identifier.citation | BMC Medical Research Methodology. 2026 Apr 14;26(1):119 | |
| dc.identifier.uri | https://doi.org/10.1186/s12874-026-02847-8 | |
| dc.identifier.uri | http://hdl.handle.net/10393/51701 | |
| dc.language.rfc3066 | en | |
| dc.rights.holder | The Author(s) | |
| dc.title | From chaos to clarity: schema-constrained AI for auditable biomedical evidence extraction from full-text PDFs | |
| dc.type | Journal Article |
