feat(preprints): add preprint reference support with EPMC full-text route by cmungall · Pull Request #54 · linkml/linkml-reference-validator

cmungall · 2026-06-28T17:31:49Z

Make preprints first-class references with the same anti-hallucination
snippet-validation guarantees as published papers.

Resolve preprints by DOI (Crossref posted-content detected as preprint)
and by Europe PMC SRC:PPR id via a new PPR: source.
Fetch preprint body text through the Europe PMC fulltextRepo PDF route
(epmc_preprint provider), added to the default full-text chain after pmc.
PDF %PDF- magic is validated by the existing acquire/extract machinery, so
stale-filename error blobs are rejected and the chain falls through.
Surface is_preprint / peer_review_status on ReferenceContent (persisted
in the cache frontmatter) so downstream KBs can apply "not sole support"
policies; peer-reviewed papers are left unannotated.
Crosswalk the preprint id (pprid) in build_identifiers.

NIH Preprint Pilot preprints carry real PMIDs and continue to resolve through
the existing PMID path unchanged.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL

…oute Make preprints first-class references with the same anti-hallucination snippet-validation guarantees as published papers. - Resolve preprints by DOI (Crossref `posted-content` detected as preprint) and by Europe PMC `SRC:PPR` id via a new `PPR:` source. - Fetch preprint body text through the Europe PMC `fulltextRepo` PDF route (`epmc_preprint` provider), added to the default full-text chain after `pmc`. PDF `%PDF-` magic is validated by the existing acquire/extract machinery, so stale-filename error blobs are rejected and the chain falls through. - Surface `is_preprint` / `peer_review_status` on ReferenceContent (persisted in the cache frontmatter) so downstream KBs can apply "not sole support" policies; peer-reviewed papers are left unannotated. - Crosswalk the preprint id (`pprid`) in build_identifiers. NIH Preprint Pilot preprints carry real PMIDs and continue to resolve through the existing PMID path unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL

claude · 2026-06-28T17:32:04Z

Claude finished @cmungall's task in 2m 13s —— View job

PR Review: preprint reference support (#54)

Read changed source files
Review new sources (doi.py, ppr.py)
Review full-text provider (epmc_preprint.py)
Review model/identifier changes
Review test coverage
Post review feedback

This is a clean, well-structured PR. The preprint signal is sourced from authoritative fields (Crossref posted-content, EPMC SRC:PPR) rather than DOI-prefix heuristics, the %PDF- magic validation correctly guards the fulltextRepo route, and the is_preprint/peer_review_status fields round-trip through the cache. Test coverage is strong and the HTTP mocking is the legitimate external-system exception to the repo's "no mock tests" rule. Below are findings, ordered by importance.

🟡 Performance — extra EPMC round-trip on the common (non-preprint) path

epmc_preprint is inserted into the default chain right after pmc (models.py:465). For every peer-reviewed paper that misses PMC full text but has a DOI, the chain now issues an extra Europe PMC search restricted to SRC:PPR, which returns nothing, plus a time.sleep(config.rate_limit_delay) (epmc_preprint.py:57). That's one added API call + delay per paper on the most common path.

The provider only sees ReferenceIdentifiers, not the ReferenceContent, so it can't cheaply consult is_preprint to short-circuit. Worth considering one of:

Documenting that this cost is accepted (the SRC:PPR filter keeps it correct, just slower), or
Threading the is_preprint hint into the provider so a known non-preprint skips the lookup.

Not blocking — just flagging the latency impact on the hot path.

🟡 Duplicated EPMC search logic between the two new files

_first_ppr_result() and the _EPMC_SEARCH_URL constant are copy-pasted in both ppr.py (:35, :124) and epmc_preprint.py (:33, :105), and both build nearly identical search params (format/resultType/pageSize/email). A small shared helper (e.g. etl/sources/epmc_common.py) would keep the source == "PPR" guard and query shape in one place. Minor maintainability point.

🟢 DataCite preprints are not flagged

_is_crossref_preprint is only consulted on the Crossref path; _fetch_from_datacite (doi.py:244) never sets is_preprint. Preprints whose DOIs resolve only via DataCite won't get the "not sole support" annotation. Likely out of scope (most bio preprints are in Crossref), but worth a comment noting the gap so it isn't mistaken for full coverage.

🟢 Minor style / robustness

doi.py:156-157: True if is_preprint else None — since is_preprint is already a bool, is_preprint or None is equivalent and reads cleaner.
epmc_preprint.py:100 / ppr.py:74: the pprid is interpolated unquoted into the Lucene query (f"{ids.pprid} AND SRC:PPR"). The source == "PPR" post-filter makes this safe in practice, but a value containing Lucene operators would alter the query shape. Low risk given the filter; a brief normalize/validate of pprid to ^PPR\d+$ would make intent explicit.
epmc_preprint.py:135: the constructed fallback URL uses type=FILE&mimeType=application/pdf while the real records (per the test fixture) also carry fileName=.... Since %PDF- validation gates it downstream, a stale/incorrect fallback URL just falls through harmlessly — good defensive design, no change needed.

✅ Strengths

_is_crossref_preprint correctly avoids the 10.1101 prefix trap (CSH journals) — documented in the docstring with a doctest.
The redundant SRC:PPR query restriction and source == "PPR" post-filter is good belt-and-suspenders against DOI-namespace collisions.
_maybe_retry_full_text / full_text_attempted semantics are respected so a preprint that transiently fails the fulltextRepo route stays retryable.
Cache frontmatter round-trip is tested in both directions (set and unset), incl. the important "non-preprint must not gain a status" case.

Note

I was unable to run the test suite in this environment (sandbox blocked uv run pytest), so I reviewed by reading. The new tests look correct and self-consistent; please confirm just test is green in CI before merge.
• branch claude/preprint-reference-support-qnc600

…qualify id queries Verified against the live Europe PMC API: - Remove the reconstructed `fulltextRepo` fallback URL: the working endpoint requires a per-record `fileName` query parameter that cannot be derived from the preprint id (a fileName-less request 500s, and the guessed ebi.ac.uk host 404s). In practice `hasPDF=Y` always ships the PDF entry in `fullTextUrlList`, so a record with no usable PDF entry now yields no location instead of a URL that always fails. - Field-qualify preprint-id lookups with `EXT_ID:` in both the epmc_preprint provider and the PPR source (confirmed to return the exact record), matching the already-qualified `DOI:"..."` branch. - Route `peer_review_status` through `_quote_yaml_value` when writing the cache frontmatter, for parity with the other string fields. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL

claude · 2026-06-28T18:20:41Z

Claude finished @cmungall's task in 1m 40s —— View job

PR Review — re-review after `245bb0c` ✅

Read latest changes (commit 245bb0c)
Check prior review items were addressed
Review for remaining bugs / quality issues
Post review feedback

The follow-up commit cleanly resolves the substantive findings from the first pass. Verifying each:

Addressed

🟢 Broken fallback URL removed — the reconstructed fulltextRepo?pprId=… guess is gone (epmc_preprint.py:116-136). Confirming the endpoint needs a per-record fileName that can't be derived from the id, and returning None rather than a URL that always fails, is the right call. The test was correctly inverted (test_locate_no_pdf_entry_returns_none) — a record with hasPDF=Y but no usable entry now asserts loc is None.
🟢 Lucene id queries qualified — both the provider (EXT_ID:{pprid} AND SRC:PPR, epmc_preprint.py:101) and the PPR source (ppr.py:74) now field-qualify, matching the already-quoted DOI:"…" branch. This addresses the unquoted-interpolation concern; the doctest was updated to match.
🟢 YAML parity — peer_review_status now routes through _quote_yaml_value on write (reference_fetcher.py:498-500).

Verified correct elsewhere

Cache round-trip: is_preprint is written as is_preprint: true and parsed back via ruamel.yaml, so it reloads as a real bool (not the string "true"). test_save_and_load_preprint_metadata asserts loaded.is_preprint is True, and the companion test confirms a non-preprint never gains a status. Round-trip is sound.
build_identifiers crosswalk (PPR: → pprid) and the SRC:PPR + source == "PPR" belt-and-suspenders guard remain intact.

Remaining minor items (all non-blocking, carried from the first review)

🟡 Extra EPMC round-trip on the common path still stands: epmc_preprint runs for every non-PMC paper with a DOI, issuing a SRC:PPR search (+rate_limit_delay) that returns nothing. Correct, just slower on the hot path. Fine to accept; worth a one-line doc note if not.
🟡 Duplication of _EPMC_SEARCH_URL + _first_ppr_result across ppr.py and epmc_preprint.py — a shared epmc_common.py would centralize the query shape. Cosmetic.
🟢 doi.py:156 — True if is_preprint else None could read as is_preprint or None (equivalent since it's already a bool). Trivial.
🟢 DataCite-only preprints (_fetch_from_datacite, doi.py:244) still aren't flagged. Likely out of scope for bio preprints (Crossref-resolved), but the gap is worth a comment so it isn't read as full coverage.
🟢 Module docstring (epmc_preprint.py:5-7) still says "every … record carries a direct … PDF URL"; the code now correctly handles records that don't. A small wording softening would match the new behavior.

Testing note: I couldn't run just test / uv run pytest in this sandbox (command blocked). The "Build and test" CI job (28331630660) is in progress — please confirm it's green before merge. The preprint test additions read as correct and self-consistent.

Net: the fixes are solid and address everything that mattered. Remaining items are all optional polish. 👍
• branch claude/preprint-reference-support-qnc600

…non-preprints - PPRSource.fetch now catches network/JSON errors at the external boundary and returns None (graceful skip), matching the PMID and ClinicalTrials sources, so a single transient failure can't abort a whole validation run. - Skip the Europe PMC round-trip in the epmc_preprint provider for records a metadata source confirmed are peer-reviewed: thread is_preprint through ReferenceIdentifiers, have DOISource record is_preprint=False for Crossref works with a known non-preprint type (the type is authoritative), and short-circuit locate() when is_preprint is False. Unknown status (PMID / DataCite) is still attempted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL

claude · 2026-06-28T19:00:20Z

Claude finished @cmungall's task in 1m 40s —— View job

Re-review after `2cdbc1f` ✅

Read latest changes (commit 2cdbc1f)
Check prior review items addressed
Review for remaining bugs / quality issues
Post review feedback

This commit closes the two most important items carried from the earlier passes. Verifying each:

Addressed

🟡→✅ Extra EPMC round-trip on the hot path is now skipped for DOI-resolved papers. is_preprint is threaded through ReferenceIdentifiers (models.py:638, identifiers.py:47) and locate() short-circuits when it's False (epmc_preprint.py:55-56) before any network call. DOISource now records is_preprint=False for any Crossref work with a known non-preprint type (doi.py:_crossref_preprint_status), treating Crossref's type as authoritative. So every peer-reviewed paper resolved via Crossref now skips the SRC:PPR search + rate_limit_delay entirely. The None ("unknown") case — PMID / DataCite records — is still attempted, which is correct (NIH Preprint Pilot preprints carry real PMIDs, so PMID can't be blanket-classified). Nicely scoped fix.
🟢→✅ True if is_preprint else None cleaned up. Now returns the tri-state Optional[bool] directly. peer_review_status="preprint" if is_preprint else None still maps both False and None to no status, so a non-preprint is recorded as is_preprint=False without being positively asserted as peer-reviewed — matching the doc table update.
🟢→✅ PPR source now degrades gracefully at the external boundary. PPRSource.fetch wraps the requests.get/.json() in a try/except returning None (ppr.py:80-93), matching the PMID/ClinicalTrials sources so one transient EPMC failure can't abort a whole run. This is the legitimate external-system exception to the repo's "avoid try/except" rule, and the comment says so.

Test coverage for the new behavior is solid

test_locate_skips_confirmed_non_preprint_without_network asserts mock_get.assert_not_called() — proves the short-circuit avoids the round-trip, not just the result.
test_locate_attempts_when_preprint_status_unknown covers the None path still hitting the network.
test_journal_article_is_marked_not_preprint (is_preprint is False, status None) and test_typeless_crossref_record_leaves_preprint_unknown (is_preprint is None) cover the new tri-state cleanly, plus the new _crossref_preprint_status({}) is None doctest.
test_build_carries_is_preprint_flag covers True/False/None propagation through build_identifiers.
test_fetch_network_exception_returns_none covers the graceful-skip path.

Remaining minor items (all non-blocking, carried forward)

🟡 Duplication of _EPMC_SEARCH_URL + _first_ppr_result across ppr.py and epmc_preprint.py is still present — a shared epmc_common.py would centralize the query shape. Purely cosmetic. Fix this →
🟢 Module docstring in epmc_preprint.py:5-7 still asserts "every SRC:PPR AND HAS_FT:Y core record carries a direct … PDF URL", but the code (and test_locate_no_pdf_entry_returns_none) now correctly handles records that don't. A one-line softening would match the actual behavior. Fix this →
🟢 DataCite-only preprints (_fetch_from_datacite) still aren't flagged — they stay is_preprint=None, so they're attempted by the provider rather than mislabeled, which is the safe default. Worth a one-line code comment noting the gap so it isn't read as full coverage. Out of scope for Crossref-resolved bio preprints.

Note: I still couldn't run just test / uv run pytest in this sandbox (command blocked by the environment). The added tests read as correct and self-consistent — please confirm the CI "Build and test" job is green before merge.

Net: the substantive performance and robustness concerns are now resolved, with good targeted tests. Only optional polish remains. 👍
• branch claude/preprint-reference-support-qnc600

cmungall merged commit 1b1c750 into main Jun 28, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(preprints): add preprint reference support with EPMC full-text route#54

feat(preprints): add preprint reference support with EPMC full-text route#54
cmungall merged 3 commits into
mainfrom
claude/preprint-reference-support-qnc600

cmungall commented Jun 28, 2026

Uh oh!

claude Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cmungall commented Jun 28, 2026

Uh oh!

claude Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: preprint reference support (#54)

🟡 Performance — extra EPMC round-trip on the common (non-preprint) path

🟡 Duplicated EPMC search logic between the two new files

🟢 DataCite preprints are not flagged

🟢 Minor style / robustness

✅ Strengths

Note

Uh oh!

claude Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review — re-review after 245bb0c ✅

Uh oh!

claude Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review after 2cdbc1f ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Jun 28, 2026 •

edited

Loading

claude Bot commented Jun 28, 2026 •

edited

Loading

PR Review — re-review after `245bb0c` ✅

claude Bot commented Jun 28, 2026 •

edited

Loading

Re-review after `2cdbc1f` ✅