From chaos to clarity: schema-constrained AI for auditable biomedical evidence extraction from full-text PDFs

Mortezaagha, Pouria; Shaw, Joseph; Sun, Bowen; Rahgozar, Arya

From chaos to clarity: schema-constrained AI for auditable biomedical evidence extraction from full-text PDFs

dc.contributor.author	Mortezaagha, Pouria
dc.contributor.author	Shaw, Joseph
dc.contributor.author	Sun, Bowen
dc.contributor.author	Rahgozar, Arya
dc.date.accessioned	2026-05-26T03:50:49Z
dc.date.available	2026-05-26T03:50:49Z
dc.date.issued	2026-04-14
dc.date.updated	2026-05-26T03:50:49Z
dc.description.abstract	Abstract Background Biomedical evidence synthesis depends on accurate extraction of methodological, laboratory, and outcome variables from full-text research articles. These variables are predominantly embedded in complex scientific PDFs that interleave multi-column text, tables, figures, and captions, making manual abstraction time-intensive, error-prone, and increasingly impractical at the scale of contemporary systematic reviews. Despite advances in layout-aware and multimodal document models, end-to-end extraction systems suitable for evidence synthesis remain constrained by limited throughput, OCR error propagation, and insufficient auditability. Methods We propose a schema-constrained AI extraction system that transforms full-text biomedical PDFs into structured, analysis-ready records by explicitly restricting model inference through typed schemas, controlled vocabularies, and evidence-gated decisions. Documents are ingested using resume-aware hashing, partitioned into page-level and caption-aware chunks, and processed asynchronously under explicit concurrency and rate-limiting controls. A high-accuracy OCR model is guided by multiple domain-specific schemas covering bibliographic metadata, study design, populations, laboratory assays, timing and thresholds, clinical outcomes, and diagnostic performance. Chunk-level outputs are deterministically merged into study-level records using controlled vocabularies, conflict-aware handling of scalar fields, set-based aggregation of list-valued fields, and sentence-level evidence capture to enable traceability and post-hoc audit. Results Applied to a corpus of 734 biomedical articles on direct oral anticoagulant (DOAC) level measurement, the pipeline processed all documents without manual intervention while maintaining stable throughput. Schema-constrained extraction exhibited strong internal consistency, with sentence-level provenance populated for nearly all supported decisions. Iterative schema and prompt refinement yielded substantial improvements in extraction fidelity, particularly for outcome definitions, assay classification, and global coagulation testing. Outputs included reproducible CSV/Parquet datasets and caption-aware multimodal markdown reconstructions supporting efficient expert review. Conclusions Schema-constrained AI extraction enables scalable and auditable extraction of structured evidence from heterogeneous scientific PDFs. By combining deterministic chunking, asynchronous orchestration, controlled vocabularies, sentence-level provenance, and aggregated analytical outputs, the proposed pipeline aligns modern document understanding capabilities with the transparency, reproducibility, and reliability demands of biomedical evidence synthesis.
dc.identifier.citation	BMC Medical Research Methodology. 2026 Apr 14;26(1):119
dc.identifier.uri	https://doi.org/10.1186/s12874-026-02847-8
dc.identifier.uri	http://hdl.handle.net/10393/51701
dc.language.rfc3066	en
dc.rights.holder	The Author(s)
dc.title	From chaos to clarity: schema-constrained AI for auditable biomedical evidence extraction from full-text PDFs
dc.type	Journal Article

Fichiers

Trousse originale

Voici les éléments 1 - 1 sur 1

Nom:: 12874_2026_Article_2847.pdf
Taille:: 3.55 MB
Format:: Adobe Portable Document Format

Télécharger

Trousse de licence

Voici les éléments 1 - 1 sur 1

Nom:: license.txt
Taille:: 2.51 KB
Format:: Item-specific license agreed upon to submission
Description:

Télécharger

Collections

Publications par les auteurs d'uOttawa publiés par BioMed Central // uOttawa authored publications from BioMed Central