Skip to content

feat(preprints): add preprint reference support with EPMC full-text route#54

Merged
cmungall merged 3 commits into
mainfrom
claude/preprint-reference-support-qnc600
Jun 28, 2026
Merged

feat(preprints): add preprint reference support with EPMC full-text route#54
cmungall merged 3 commits into
mainfrom
claude/preprint-reference-support-qnc600

Conversation

@cmungall

Copy link
Copy Markdown
Member

Make preprints first-class references with the same anti-hallucination
snippet-validation guarantees as published papers.

  • Resolve preprints by DOI (Crossref posted-content detected as preprint)
    and by Europe PMC SRC:PPR id via a new PPR: source.
  • Fetch preprint body text through the Europe PMC fulltextRepo PDF route
    (epmc_preprint provider), added to the default full-text chain after pmc.
    PDF %PDF- magic is validated by the existing acquire/extract machinery, so
    stale-filename error blobs are rejected and the chain falls through.
  • Surface is_preprint / peer_review_status on ReferenceContent (persisted
    in the cache frontmatter) so downstream KBs can apply "not sole support"
    policies; peer-reviewed papers are left unannotated.
  • Crosswalk the preprint id (pprid) in build_identifiers.

NIH Preprint Pilot preprints carry real PMIDs and continue to resolve through
the existing PMID path unchanged.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL

…oute

Make preprints first-class references with the same anti-hallucination
snippet-validation guarantees as published papers.

- Resolve preprints by DOI (Crossref `posted-content` detected as preprint)
  and by Europe PMC `SRC:PPR` id via a new `PPR:` source.
- Fetch preprint body text through the Europe PMC `fulltextRepo` PDF route
  (`epmc_preprint` provider), added to the default full-text chain after `pmc`.
  PDF `%PDF-` magic is validated by the existing acquire/extract machinery, so
  stale-filename error blobs are rejected and the chain falls through.
- Surface `is_preprint` / `peer_review_status` on ReferenceContent (persisted
  in the cache frontmatter) so downstream KBs can apply "not sole support"
  policies; peer-reviewed papers are left unannotated.
- Crosswalk the preprint id (`pprid`) in build_identifiers.

NIH Preprint Pilot preprints carry real PMIDs and continue to resolve through
the existing PMID path unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cmungall's task in 2m 13s —— View job


PR Review: preprint reference support (#54)

  • Read changed source files
  • Review new sources (doi.py, ppr.py)
  • Review full-text provider (epmc_preprint.py)
  • Review model/identifier changes
  • Review test coverage
  • Post review feedback

This is a clean, well-structured PR. The preprint signal is sourced from authoritative fields (Crossref posted-content, EPMC SRC:PPR) rather than DOI-prefix heuristics, the %PDF- magic validation correctly guards the fulltextRepo route, and the is_preprint/peer_review_status fields round-trip through the cache. Test coverage is strong and the HTTP mocking is the legitimate external-system exception to the repo's "no mock tests" rule. Below are findings, ordered by importance.

🟡 Performance — extra EPMC round-trip on the common (non-preprint) path

epmc_preprint is inserted into the default chain right after pmc (models.py:465). For every peer-reviewed paper that misses PMC full text but has a DOI, the chain now issues an extra Europe PMC search restricted to SRC:PPR, which returns nothing, plus a time.sleep(config.rate_limit_delay) (epmc_preprint.py:57). That's one added API call + delay per paper on the most common path.

The provider only sees ReferenceIdentifiers, not the ReferenceContent, so it can't cheaply consult is_preprint to short-circuit. Worth considering one of:

  • Documenting that this cost is accepted (the SRC:PPR filter keeps it correct, just slower), or
  • Threading the is_preprint hint into the provider so a known non-preprint skips the lookup.

Not blocking — just flagging the latency impact on the hot path.

🟡 Duplicated EPMC search logic between the two new files

_first_ppr_result() and the _EPMC_SEARCH_URL constant are copy-pasted in both ppr.py (:35, :124) and epmc_preprint.py (:33, :105), and both build nearly identical search params (format/resultType/pageSize/email). A small shared helper (e.g. etl/sources/epmc_common.py) would keep the source == "PPR" guard and query shape in one place. Minor maintainability point.

🟢 DataCite preprints are not flagged

_is_crossref_preprint is only consulted on the Crossref path; _fetch_from_datacite (doi.py:244) never sets is_preprint. Preprints whose DOIs resolve only via DataCite won't get the "not sole support" annotation. Likely out of scope (most bio preprints are in Crossref), but worth a comment noting the gap so it isn't mistaken for full coverage.

🟢 Minor style / robustness

  • doi.py:156-157: True if is_preprint else None — since is_preprint is already a bool, is_preprint or None is equivalent and reads cleaner.
  • epmc_preprint.py:100 / ppr.py:74: the pprid is interpolated unquoted into the Lucene query (f"{ids.pprid} AND SRC:PPR"). The source == "PPR" post-filter makes this safe in practice, but a value containing Lucene operators would alter the query shape. Low risk given the filter; a brief normalize/validate of pprid to ^PPR\d+$ would make intent explicit.
  • epmc_preprint.py:135: the constructed fallback URL uses type=FILE&mimeType=application/pdf while the real records (per the test fixture) also carry fileName=.... Since %PDF- validation gates it downstream, a stale/incorrect fallback URL just falls through harmlessly — good defensive design, no change needed.

✅ Strengths

  • _is_crossref_preprint correctly avoids the 10.1101 prefix trap (CSH journals) — documented in the docstring with a doctest.
  • The redundant SRC:PPR query restriction and source == "PPR" post-filter is good belt-and-suspenders against DOI-namespace collisions.
  • _maybe_retry_full_text / full_text_attempted semantics are respected so a preprint that transiently fails the fulltextRepo route stays retryable.
  • Cache frontmatter round-trip is tested in both directions (set and unset), incl. the important "non-preprint must not gain a status" case.

Note

I was unable to run the test suite in this environment (sandbox blocked uv run pytest), so I reviewed by reading. The new tests look correct and self-consistent; please confirm just test is green in CI before merge.
• branch claude/preprint-reference-support-qnc600

…qualify id queries

Verified against the live Europe PMC API:

- Remove the reconstructed `fulltextRepo` fallback URL: the working endpoint
  requires a per-record `fileName` query parameter that cannot be derived from
  the preprint id (a fileName-less request 500s, and the guessed ebi.ac.uk host
  404s). In practice `hasPDF=Y` always ships the PDF entry in `fullTextUrlList`,
  so a record with no usable PDF entry now yields no location instead of a URL
  that always fails.
- Field-qualify preprint-id lookups with `EXT_ID:` in both the epmc_preprint
  provider and the PPR source (confirmed to return the exact record), matching
  the already-qualified `DOI:"..."` branch.
- Route `peer_review_status` through `_quote_yaml_value` when writing the cache
  frontmatter, for parity with the other string fields.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cmungall's task in 1m 40s —— View job


PR Review — re-review after 245bb0c

  • Read latest changes (commit 245bb0c)
  • Check prior review items were addressed
  • Review for remaining bugs / quality issues
  • Post review feedback

The follow-up commit cleanly resolves the substantive findings from the first pass. Verifying each:

Addressed

  • 🟢 Broken fallback URL removed — the reconstructed fulltextRepo?pprId=… guess is gone (epmc_preprint.py:116-136). Confirming the endpoint needs a per-record fileName that can't be derived from the id, and returning None rather than a URL that always fails, is the right call. The test was correctly inverted (test_locate_no_pdf_entry_returns_none) — a record with hasPDF=Y but no usable entry now asserts loc is None.
  • 🟢 Lucene id queries qualified — both the provider (EXT_ID:{pprid} AND SRC:PPR, epmc_preprint.py:101) and the PPR source (ppr.py:74) now field-qualify, matching the already-quoted DOI:"…" branch. This addresses the unquoted-interpolation concern; the doctest was updated to match.
  • 🟢 YAML paritypeer_review_status now routes through _quote_yaml_value on write (reference_fetcher.py:498-500).

Verified correct elsewhere

  • Cache round-trip: is_preprint is written as is_preprint: true and parsed back via ruamel.yaml, so it reloads as a real bool (not the string "true"). test_save_and_load_preprint_metadata asserts loaded.is_preprint is True, and the companion test confirms a non-preprint never gains a status. Round-trip is sound.
  • build_identifiers crosswalk (PPR:pprid) and the SRC:PPR + source == "PPR" belt-and-suspenders guard remain intact.

Remaining minor items (all non-blocking, carried from the first review)

  • 🟡 Extra EPMC round-trip on the common path still stands: epmc_preprint runs for every non-PMC paper with a DOI, issuing a SRC:PPR search (+rate_limit_delay) that returns nothing. Correct, just slower on the hot path. Fine to accept; worth a one-line doc note if not.
  • 🟡 Duplication of _EPMC_SEARCH_URL + _first_ppr_result across ppr.py and epmc_preprint.py — a shared epmc_common.py would centralize the query shape. Cosmetic.
  • 🟢 doi.py:156True if is_preprint else None could read as is_preprint or None (equivalent since it's already a bool). Trivial.
  • 🟢 DataCite-only preprints (_fetch_from_datacite, doi.py:244) still aren't flagged. Likely out of scope for bio preprints (Crossref-resolved), but the gap is worth a comment so it isn't read as full coverage.
  • 🟢 Module docstring (epmc_preprint.py:5-7) still says "every … record carries a direct … PDF URL"; the code now correctly handles records that don't. A small wording softening would match the new behavior.

Testing note: I couldn't run just test / uv run pytest in this sandbox (command blocked). The "Build and test" CI job (28331630660) is in progress — please confirm it's green before merge. The preprint test additions read as correct and self-consistent.

Net: the fixes are solid and address everything that mattered. Remaining items are all optional polish. 👍
• branch claude/preprint-reference-support-qnc600

…non-preprints

- PPRSource.fetch now catches network/JSON errors at the external boundary and
  returns None (graceful skip), matching the PMID and ClinicalTrials sources, so
  a single transient failure can't abort a whole validation run.
- Skip the Europe PMC round-trip in the epmc_preprint provider for records a
  metadata source confirmed are peer-reviewed: thread is_preprint through
  ReferenceIdentifiers, have DOISource record is_preprint=False for Crossref
  works with a known non-preprint type (the type is authoritative), and
  short-circuit locate() when is_preprint is False. Unknown status (PMID /
  DataCite) is still attempted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SxPb8WHvSHJJkwB5njMsZL
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cmungall's task in 1m 40s —— View job


Re-review after 2cdbc1f

  • Read latest changes (commit 2cdbc1f)
  • Check prior review items addressed
  • Review for remaining bugs / quality issues
  • Post review feedback

This commit closes the two most important items carried from the earlier passes. Verifying each:

Addressed

  • 🟡→✅ Extra EPMC round-trip on the hot path is now skipped for DOI-resolved papers. is_preprint is threaded through ReferenceIdentifiers (models.py:638, identifiers.py:47) and locate() short-circuits when it's False (epmc_preprint.py:55-56) before any network call. DOISource now records is_preprint=False for any Crossref work with a known non-preprint type (doi.py:_crossref_preprint_status), treating Crossref's type as authoritative. So every peer-reviewed paper resolved via Crossref now skips the SRC:PPR search + rate_limit_delay entirely. The None ("unknown") case — PMID / DataCite records — is still attempted, which is correct (NIH Preprint Pilot preprints carry real PMIDs, so PMID can't be blanket-classified). Nicely scoped fix.
  • 🟢→✅ True if is_preprint else None cleaned up. Now returns the tri-state Optional[bool] directly. peer_review_status="preprint" if is_preprint else None still maps both False and None to no status, so a non-preprint is recorded as is_preprint=False without being positively asserted as peer-reviewed — matching the doc table update.
  • 🟢→✅ PPR source now degrades gracefully at the external boundary. PPRSource.fetch wraps the requests.get/.json() in a try/except returning None (ppr.py:80-93), matching the PMID/ClinicalTrials sources so one transient EPMC failure can't abort a whole run. This is the legitimate external-system exception to the repo's "avoid try/except" rule, and the comment says so.

Test coverage for the new behavior is solid

  • test_locate_skips_confirmed_non_preprint_without_network asserts mock_get.assert_not_called() — proves the short-circuit avoids the round-trip, not just the result.
  • test_locate_attempts_when_preprint_status_unknown covers the None path still hitting the network.
  • test_journal_article_is_marked_not_preprint (is_preprint is False, status None) and test_typeless_crossref_record_leaves_preprint_unknown (is_preprint is None) cover the new tri-state cleanly, plus the new _crossref_preprint_status({}) is None doctest.
  • test_build_carries_is_preprint_flag covers True/False/None propagation through build_identifiers.
  • test_fetch_network_exception_returns_none covers the graceful-skip path.

Remaining minor items (all non-blocking, carried forward)

  • 🟡 Duplication of _EPMC_SEARCH_URL + _first_ppr_result across ppr.py and epmc_preprint.py is still present — a shared epmc_common.py would centralize the query shape. Purely cosmetic. Fix this →
  • 🟢 Module docstring in epmc_preprint.py:5-7 still asserts "every SRC:PPR AND HAS_FT:Y core record carries a direct … PDF URL", but the code (and test_locate_no_pdf_entry_returns_none) now correctly handles records that don't. A one-line softening would match the actual behavior. Fix this →
  • 🟢 DataCite-only preprints (_fetch_from_datacite) still aren't flagged — they stay is_preprint=None, so they're attempted by the provider rather than mislabeled, which is the safe default. Worth a one-line code comment noting the gap so it isn't read as full coverage. Out of scope for Crossref-resolved bio preprints.

Note: I still couldn't run just test / uv run pytest in this sandbox (command blocked by the environment). The added tests read as correct and self-consistent — please confirm the CI "Build and test" job is green before merge.

Net: the substantive performance and robustness concerns are now resolved, with good targeted tests. Only optional polish remains. 👍
• branch claude/preprint-reference-support-qnc600

@cmungall cmungall merged commit 1b1c750 into main Jun 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants