Strip duplicated JSON blocks from component analysis Markdown#13
Merged
ethenotethan merged 1 commit intomainfrom Apr 22, 2026
Merged
Strip duplicated JSON blocks from component analysis Markdown#13ethenotethan merged 1 commit intomainfrom
ethenotethan merged 1 commit intomainfrom
Conversation
The component-analyzer emitted two fenced JSON blocks in every
`.md` — `## Citations` and `## Analysis Data` — that were also
redundant with machine-readable artefacts:
- `## Citations` was lifted into `{component}.citations.json` by the
citation extractor, which enriches it with `component`,
`source_repo`, `source_commit`, `citation_count`, and per-citation
`source_url`. The sidecar is strictly richer than the Markdown
block, so the copy in the `.md` was pure duplication.
- `## Analysis Data` was never consumed anywhere downstream (no
extractor, no reader, no test) — pure dead weight.
The Markdown is meant for humans; mixing in duplicated machine output
just adds noise (api.md lost 94 lines, ~3.7 KB). Strip both sections
from the Markdown after the sidecar is successfully written, so the
`.md` stays prose + Mermaid + structured human-facing sections only.
Stripping is gated on successful extraction: if the `## Citations`
block fails to parse, the block is left in place so the malformed
content is still visible for debugging.
- agent/utils/citation_extractor.py: new public
`strip_extracted_sections(md)` helper (idempotent, collapses blank
runs, normalises trailing newline). `build_citations_index` invokes
it in-place after each successful per-component sidecar write.
- agent/discovery/validator.py: `_validate_citations_in_file` ->
`_validate_citations_sidecar`. Reads the `.citations.json`
sidecar instead of regex-parsing the Markdown (since the block is
no longer guaranteed to be there). Spot-check logic unchanged — the
citation dict shape in the sidecar is identical to the former
Markdown JSON.
- tests/test_citations.py: 7 new tests covering the strip helper
directly (round-trip strip, idempotence, `## Analysis Summary`
false-positive guard, no-op on clean input, trailing-newline
normalisation) and the end-to-end `build_citations_index`
contract (MD stripped only when sidecar is written; MD untouched
when extraction fails).
All 254 tests pass.
0xClandestine
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.md—## Citationsand## Analysis Data— that duplicate (or gounread by) other artefacts. Strip them from the Markdown after extraction
so the human-facing analysis stays prose-only.
## Citationsis already lifted into{component}.citations.jsonby thecitation extractor, with richer content (
source_repo,source_commit,citation_count, per-citationsource_url). The Markdown copy was pureduplication.
## Analysis Datawas never consumed anywhere downstream — no extractor,no reader, no test. Pure dead weight.
Details
agent/utils/citation_extractor.pystrip_extracted_sections(md)helper: idempotent, collapsesblank-line runs left behind, normalises trailing newline.
build_citations_indexinvokes it in-place after each successfulper-component sidecar write. Gated on success — if the
## Citationsblock fails to parse, the block is left in the
.mdso malformedcontent is still visible for debugging.
agent/discovery/validator.py_validate_citations_in_file→_validate_citations_sidecar. Reads{component}.citations.jsoninstead of regex-parsing the Markdown(since the block is no longer guaranteed to be there after extraction).
identical to the former Markdown JSON.
tests/test_citations.pySeven new tests covering:
## Analysis Summaryas a false-positive guard, no-op on clean input,trailing-newline normalisation).
build_citations_indexcontract: Markdown stripped onlywhen the sidecar was successfully written; Markdown untouched when
extraction fails.
Impact
reductions across all 11 components in the omlx run.
.venv/bin/python -m pytest -x -q).## Citationsblock is present (and enriched) in the sidecar JSON.
Out of scope
Noticed while investigating but not fixed here: the architecture
documenter emits the raw
templates/architecture_template.mdlargelyunfilled (placeholders like
{For each: ...}left in place, theillustrative "User Registration Flow" example copied verbatim).
Separate issue, separate PR.