Skip to content

feat(seo): static sitemap.xml with git-based lastmod#222

Open
dcrawbuck wants to merge 3 commits into
mainfrom
dcrawbuck/raleigh-v2
Open

feat(seo): static sitemap.xml with git-based lastmod#222
dcrawbuck wants to merge 3 commits into
mainfrom
dcrawbuck/raleigh-v2

Conversation

@dcrawbuck

@dcrawbuck dcrawbuck commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

What & why

/docs/sitemap.xml was generated at request time on the Cloudflare Worker and stamped
every URL with new Date().toISOString(). That tells Google "every page changed just now"
on every crawl, so Google learns to ignore our <lastmod> entirely (it only trusts the field
when it's consistently accurate).

This replaces that with accurate, build-time <lastmod> derived from git history, served
as a static asset. Verified end-to-end on a Cloudflare preview deploy: 592 URLs, all dated.

Approach

Cloudflare Workers have no filesystem / git at request time, so dates are resolved during
bun run build (Node, full repo) — same pattern as the existing generate-static-cache /
generate-search-index post-build scripts. The site has no runtime content source (MDX is
compiled into the bundle), so the page set is fixed at build time and a live route buys
nothing. The generated dist/client/docs/sitemap.xml is served at /docs/sitemap.xml — the
same delivery path proven by search-index.json (confirmed in preview: HTTP 200,
application/xml).

How dates are computed

  • One git log --no-merges --name-only --pretty=format:…%cs pass builds a
    file → latest-commit-date (YYYY-MM-DD) map (1 subprocess, not ~590).
  • URL → source file mapping:
    • Content pages → content/docs/<page.path>.
    • <include> dependencies are resolved transitively — 107 pages render shared bodies
      from content/shared/**, so an edit to a shared file bumps every page that includes it.
    • Component data dependencies/docs/changelog renders <ChangelogTimeline/>, which
      imports the committed src/lib/changelog-entries.json; that file is added as a
      supplemental source so changelog regenerations bump the page's date.
    • /docssrc/routes/index.tsx; /home (301→dashboard) → dashboard content.
    • SDK landing pages (/ios, /android, …) inherit their content source and only get a
      priority bump (single source of truth).
  • Each entry's date = most recent commit among its source files. Unknown → <lastmod> omitted
    (never falls back to new Date()).

Robustness — shallow clones self-heal

Deploy environments (Cloudflare Workers Builds) shallow-clone with no fetch-depth setting,
which would otherwise leave every page date-less. The generator detects a shallow clone and
deepens it with git fetch --unshallow (anonymous — the repo is public). Verified in the
Cloudflare build log: ✓ Fetched full git history → 592 urls (592 with <lastmod>). If history
still can't be obtained, it omits <lastmod> rather than publish a wrong date, and never
fails the build
(git errors degrade gracefully).

Changes

  • src/lib/sitemap.ts — pure, worker-safe: getSitemapSourceEntries (dedupe + priority
    merge), attachLastModified (date resolution injected by caller), optional <lastmod>.
  • scripts/generate-sitemap.ts — new build-time generator (git dates, include + component
    data resolution, shallow self-heal, graceful degradation), wired into build.
  • Deleted runtime route src/routes/sitemap[.]xml.ts (+ regenerated routeTree.gen.ts).
  • src/lib/seo-routes.test.ts — updated for the new API.

Testing

  • bun test — 69 pass.
  • Cloudflare preview build: green; build log shows shallow→unshallow→592 dated URLs→deployed.
  • Local regen: 592 URLs, 592 <lastmod>, valid XML (xmllint); /docs/changelog correctly
    reflects max(wrapper, changelog JSON); include resolution verified.

Note on current dates

~586 of ~590 pages currently share 2026-06-23 because of recent bulk commits (#218/#219).
That's accurate git history; dates diverge naturally as pages are edited individually.

Notes


Note

Low Risk
SEO/build pipeline change only; graceful degradation on git issues and no auth or runtime behavior changes beyond sitemap delivery path.

Overview
Replaces request-time sitemap generation (every URL stamped with new Date()) with a build-time static dist/client/docs/sitemap.xml, served like search-index.json on Cloudflare Workers where git/fs are unavailable at runtime.

scripts/generate-sitemap.ts runs after vite build, loads the docs page list via a lightweight Vite SSR pass, maps each URL to backing source files (MDX paths, transitive <include> deps, and supplemental files like changelog JSON), and sets <lastmod> from a single git log pass. Shallow CI clones are deepened with git fetch --unshallow when possible; git failures omit <lastmod> without failing the build.

src/lib/sitemap.ts is refactored into pure helpers: getSitemapSourceEntries (dedupe, static landing priorities, source path mapping), attachLastModified (injected date resolution), and XML that only emits <lastmod> when a date is known. The TanStack sitemap[.]xml worker route is removed; tests in seo-routes.test.ts cover the new API.

Reviewed by Cursor Bugbot for commit c5373bd. Bugbot is set up for automated code reviews on this repo. Configure here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a400d681f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/generate-sitemap.ts
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 25, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
superwall-docs c5373bd Jun 26 2026, 10:46 PM

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d83f1de65f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/generate-sitemap.ts
Comment thread scripts/generate-sitemap.ts
Replace the request-time sitemap route (which stamped every URL with new
Date() on each crawl, training Google to ignore <lastmod>) with a
build-time static sitemap whose <lastmod> comes from real git history.

- New scripts/generate-sitemap.ts (runs in the build chain): one git log
  pass for per-file dates, resolves <include> deps into content/shared so
  shared edits bump the right pages, and serves dist/client/docs/sitemap.xml.
- Shallow clones omit <lastmod> rather than publish one wrong date; git
  failures degrade gracefully instead of breaking the build.
- src/lib/sitemap.ts refactored to pure, testable, worker-safe helpers.
- Remove runtime route src/routes/sitemap[.]xml.ts (regenerates routeTree).
Two follow-ups from PR review on the sitemap generator:

- Deploy environments (Cloudflare Workers Builds) shallow-clone with no
  fetch-depth setting, which left the deployed sitemap with no <lastmod>.
  Detect a shallow clone and deepen it with 'git fetch --unshallow'
  (anonymous; the repo is public). Falls back to omitting <lastmod> if
  history still can't be obtained — never fails the build.
- /docs/changelog renders <ChangelogTimeline/>, which imports the committed
  src/lib/changelog-entries.json. Add that data file as a supplemental source
  so changelog regenerations bump the page's date.
@dcrawbuck dcrawbuck force-pushed the dcrawbuck/raleigh-v2 branch from d83f1de to 622c80a Compare June 26, 2026 00:12

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 622c80a. Configure here.

Comment thread scripts/generate-sitemap.ts Outdated
If 'git rev-parse --is-shallow-repository' itself errors, the old code
treated the repo as non-shallow and still computed dates — which on a
shallow clone could publish clustered, misleading <lastmod> values. Now an
unreadable depth probe omits <lastmod> instead, matching the script's
'never publish wrong dates' policy. Also drops the redundant post-fetch
re-check (a successful --unshallow already implies full history) and the
now-unused isShallowRepository helper.

Addresses Cursor Bugbot review on PR #222.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant