Add diagram anchor integrity check (EDUENG-613)#23138
Add diagram anchor integrity check (EDUENG-613)#23138ebembi-crdb wants to merge 1 commit intomainfrom
Conversation
Adds a script and daily workflow that scan doc files for remote_include tags pulling from cockroachdb/generated-diagrams grammar_svg, fetch each referenced diagram HTML, and verify that every sql-grammar.html#ANCHOR link inside it resolves against stmt_block.html on the same branch. This catches the exact failure that blocked production builds on 2026-01-29: show_statement_hints.html referenced sql-grammar.html#opt_with_show_hints_options but that anchor did not yet exist in stmt_block.html on release-26.1. Files added: - .github/scripts/validate_diagram_anchors.py - .github/workflows/validate-diagram-anchors.yml
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
Files changed:
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify project configuration. |
| import urllib.request | ||
| from pathlib import Path | ||
| from typing import Optional | ||
|
|
There was a problem hiding this comment.
The script fetches diagram and stmt_block.html via raw.githubusercontent.com and adds Authorization: Bearer if GITHUB_TOKEN is present. The raw host (raw.githubusercontent.com) does not use GitHub API auth tokens the same way the API does; adding the header is benign but may not help for private repos or rate limits. For authenticated/robust fetches:
Use the GitHub REST API endpoint:
GET /repos/{owner}/{repo}/contents/{path}?ref={branch}
which returns base64-encoded content and works with Authorization: Bearer <GITHUB_TOKEN>.
Or check that the raw URLs will not be rate-limited for your usage pattern (daily + PR checks is probably fine).
| # --------------------------------------------------------------------------- | ||
| # Parsing helpers | ||
| # --------------------------------------------------------------------------- | ||
|
|
There was a problem hiding this comment.
get_stmt_block_anchors currently uses re.findall(r'\bid="'["']', content) which works in most cases but is fragile for edge HTML (e.g., ids broken across attributes/newlines, or presence of HTML comments/inline scripts). Since you explicitly avoid external deps, consider using Python’s stdlib html.parser to reliably collect id attributes:
Summary
Adds a script and GitHub Actions workflow that catch broken
sql-grammar.html#anchorreferences in SQL diagram files before they block production builds.How it works:
{% remote_include ... crdb_branch_name .../grammar_svg/DIAGRAM.html %}tagscockroachdb/generated-diagramssql-grammar.html#ANCHORlink inside it resolves againststmt_block.htmlon the same branchFiles added:
.github/scripts/validate_diagram_anchors.py— stdlib-only Python script, no pip installs required.github/workflows/validate-diagram-anchors.yml— triggers on*.mdchanges in PRs (changed files only) and runs a full scan daily at 07:15 UTCOn PR failure: posts/updates a bot comment with the broken diagram, branch, and anchor, plus which doc files reference it. Blocks merge.
On scheduled failure: opens a GitHub issue (or updates an existing open one) with label
sql-diagram-validation.Context
Addresses EDUENG-613, part of the follow-up to the production build outage on 2026-01-29 discussed in #docs-site-status.
Root cause of that outage:
show_statement_hints.htmlwas added with a link tosql-grammar.html#opt_with_show_hints_options, but that anchor didn't yet exist instmt_block.htmlonrelease-26.1. This check would have caught it before merge.The branch existence check (EDUENG-614) is in a separate PR: #23137
Test plan
python .github/scripts/validate_diagram_anchors.py src/current/v26.1/show-statement-hints.mdfrom repo root.mdchange in a test PR and only scans changed files