feat(preprints): add preprint reference support with EPMC full-text route#54
Conversation
…oute Make preprints first-class references with the same anti-hallucination snippet-validation guarantees as published papers. - Resolve preprints by DOI (Crossref `posted-content` detected as preprint) and by Europe PMC `SRC:PPR` id via a new `PPR:` source. - Fetch preprint body text through the Europe PMC `fulltextRepo` PDF route (`epmc_preprint` provider), added to the default full-text chain after `pmc`. PDF `%PDF-` magic is validated by the existing acquire/extract machinery, so stale-filename error blobs are rejected and the chain falls through. - Surface `is_preprint` / `peer_review_status` on ReferenceContent (persisted in the cache frontmatter) so downstream KBs can apply "not sole support" policies; peer-reviewed papers are left unannotated. - Crosswalk the preprint id (`pprid`) in build_identifiers. NIH Preprint Pilot preprints carry real PMIDs and continue to resolve through the existing PMID path unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL
|
Claude finished @cmungall's task in 2m 13s —— View job PR Review: preprint reference support (#54)
This is a clean, well-structured PR. The preprint signal is sourced from authoritative fields (Crossref 🟡 Performance — extra EPMC round-trip on the common (non-preprint) path
The provider only sees
Not blocking — just flagging the latency impact on the hot path. 🟡 Duplicated EPMC search logic between the two new files
🟢 DataCite preprints are not flagged
🟢 Minor style / robustness
✅ Strengths
NoteI was unable to run the test suite in this environment (sandbox blocked |
…qualify id queries Verified against the live Europe PMC API: - Remove the reconstructed `fulltextRepo` fallback URL: the working endpoint requires a per-record `fileName` query parameter that cannot be derived from the preprint id (a fileName-less request 500s, and the guessed ebi.ac.uk host 404s). In practice `hasPDF=Y` always ships the PDF entry in `fullTextUrlList`, so a record with no usable PDF entry now yields no location instead of a URL that always fails. - Field-qualify preprint-id lookups with `EXT_ID:` in both the epmc_preprint provider and the PPR source (confirmed to return the exact record), matching the already-qualified `DOI:"..."` branch. - Route `peer_review_status` through `_quote_yaml_value` when writing the cache frontmatter, for parity with the other string fields. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL
|
Claude finished @cmungall's task in 1m 40s —— View job PR Review — re-review after
|
…non-preprints - PPRSource.fetch now catches network/JSON errors at the external boundary and returns None (graceful skip), matching the PMID and ClinicalTrials sources, so a single transient failure can't abort a whole validation run. - Skip the Europe PMC round-trip in the epmc_preprint provider for records a metadata source confirmed are peer-reviewed: thread is_preprint through ReferenceIdentifiers, have DOISource record is_preprint=False for Crossref works with a known non-preprint type (the type is authoritative), and short-circuit locate() when is_preprint is False. Unknown status (PMID / DataCite) is still attempted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL
|
Claude finished @cmungall's task in 1m 40s —— View job Re-review after
|
Make preprints first-class references with the same anti-hallucination
snippet-validation guarantees as published papers.
posted-contentdetected as preprint)and by Europe PMC
SRC:PPRid via a newPPR:source.fulltextRepoPDF route(
epmc_preprintprovider), added to the default full-text chain afterpmc.PDF
%PDF-magic is validated by the existing acquire/extract machinery, sostale-filename error blobs are rejected and the chain falls through.
is_preprint/peer_review_statuson ReferenceContent (persistedin the cache frontmatter) so downstream KBs can apply "not sole support"
policies; peer-reviewed papers are left unannotated.
pprid) in build_identifiers.NIH Preprint Pilot preprints carry real PMIDs and continue to resolve through
the existing PMID path unchanged.
Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL