Strip duplicated JSON blocks from component analysis Markdown by ethenotethan · Pull Request #13 · Layr-Labs/github-flashlight

ethenotethan · 2026-04-22T16:37:14Z

Summary

The component-analyzer has been emitting two fenced JSON blocks in every
.md — ## Citations and ## Analysis Data — that duplicate (or go
unread by) other artefacts. Strip them from the Markdown after extraction
so the human-facing analysis stays prose-only.
## Citations is already lifted into {component}.citations.json by the
citation extractor, with richer content (source_repo, source_commit,
citation_count, per-citation source_url). The Markdown copy was pure
duplication.
## Analysis Data was never consumed anywhere downstream — no extractor,
no reader, no test. Pure dead weight.

Details

`agent/utils/citation_extractor.py`

New public strip_extracted_sections(md) helper: idempotent, collapses
blank-line runs left behind, normalises trailing newline.
build_citations_index invokes it in-place after each successful
per-component sidecar write. Gated on success — if the ## Citations
block fails to parse, the block is left in the .md so malformed
content is still visible for debugging.

`agent/discovery/validator.py`

_validate_citations_in_file → _validate_citations_sidecar. Reads
{component}.citations.json instead of regex-parsing the Markdown
(since the block is no longer guaranteed to be there after extraction).
Spot-check logic unchanged — the citation dict shape in the sidecar is
identical to the former Markdown JSON.

`tests/test_citations.py`

Seven new tests covering:

Strip helper behaviour (strips both sections, idempotent, preserves
## Analysis Summary as a false-positive guard, no-op on clean input,
trailing-newline normalisation).
End-to-end build_citations_index contract: Markdown stripped only
when the sidecar was successfully written; Markdown untouched when
extraction fails.

Impact

api.md dropped from 209 → 115 lines (~3.7 KB) after cleaning. Similar
reductions across all 11 components in the omlx run.
All 254 tests pass (.venv/bin/python -m pytest -x -q).
No information loss: every field in the former Markdown ## Citations
block is present (and enriched) in the sidecar JSON.

Out of scope

Noticed while investigating but not fixed here: the architecture
documenter emits the raw templates/architecture_template.md largely
unfilled (placeholders like {For each: ...} left in place, the
illustrative "User Registration Flow" example copied verbatim).
Separate issue, separate PR.

The component-analyzer emitted two fenced JSON blocks in every `.md` — `## Citations` and `## Analysis Data` — that were also redundant with machine-readable artefacts: - `## Citations` was lifted into `{component}.citations.json` by the citation extractor, which enriches it with `component`, `source_repo`, `source_commit`, `citation_count`, and per-citation `source_url`. The sidecar is strictly richer than the Markdown block, so the copy in the `.md` was pure duplication. - `## Analysis Data` was never consumed anywhere downstream (no extractor, no reader, no test) — pure dead weight. The Markdown is meant for humans; mixing in duplicated machine output just adds noise (api.md lost 94 lines, ~3.7 KB). Strip both sections from the Markdown after the sidecar is successfully written, so the `.md` stays prose + Mermaid + structured human-facing sections only. Stripping is gated on successful extraction: if the `## Citations` block fails to parse, the block is left in place so the malformed content is still visible for debugging. - agent/utils/citation_extractor.py: new public `strip_extracted_sections(md)` helper (idempotent, collapses blank runs, normalises trailing newline). `build_citations_index` invokes it in-place after each successful per-component sidecar write. - agent/discovery/validator.py: `_validate_citations_in_file` -> `_validate_citations_sidecar`. Reads the `.citations.json` sidecar instead of regex-parsing the Markdown (since the block is no longer guaranteed to be there). Spot-check logic unchanged — the citation dict shape in the sidecar is identical to the former Markdown JSON. - tests/test_citations.py: 7 new tests covering the strip helper directly (round-trip strip, idempotence, `## Analysis Summary` false-positive guard, no-op on clean input, trailing-newline normalisation) and the end-to-end `build_citations_index` contract (MD stripped only when sidecar is written; MD untouched when extraction fails). All 254 tests pass.

0xClandestine self-requested a review April 22, 2026 16:40

0xClandestine approved these changes Apr 22, 2026

View reviewed changes

ethenotethan merged commit 3d73793 into main Apr 22, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip duplicated JSON blocks from component analysis Markdown#13

Strip duplicated JSON blocks from component analysis Markdown#13
ethenotethan merged 1 commit intomainfrom
dedupe-citations-md

ethenotethan commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ethenotethan commented Apr 22, 2026

Summary

Details

agent/utils/citation_extractor.py

agent/discovery/validator.py

tests/test_citations.py

Impact

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`agent/utils/citation_extractor.py`

`agent/discovery/validator.py`

`tests/test_citations.py`