Skip to content

Strip duplicated JSON blocks from component analysis Markdown#13

Merged
ethenotethan merged 1 commit intomainfrom
dedupe-citations-md
Apr 22, 2026
Merged

Strip duplicated JSON blocks from component analysis Markdown#13
ethenotethan merged 1 commit intomainfrom
dedupe-citations-md

Conversation

@ethenotethan
Copy link
Copy Markdown
Collaborator

Summary

  • The component-analyzer has been emitting two fenced JSON blocks in every
    .md## Citations and ## Analysis Data — that duplicate (or go
    unread by) other artefacts. Strip them from the Markdown after extraction
    so the human-facing analysis stays prose-only.
  • ## Citations is already lifted into {component}.citations.json by the
    citation extractor, with richer content (source_repo, source_commit,
    citation_count, per-citation source_url). The Markdown copy was pure
    duplication.
  • ## Analysis Data was never consumed anywhere downstream — no extractor,
    no reader, no test. Pure dead weight.

Details

agent/utils/citation_extractor.py

  • New public strip_extracted_sections(md) helper: idempotent, collapses
    blank-line runs left behind, normalises trailing newline.
  • build_citations_index invokes it in-place after each successful
    per-component sidecar write. Gated on success — if the ## Citations
    block fails to parse, the block is left in the .md so malformed
    content is still visible for debugging.

agent/discovery/validator.py

  • _validate_citations_in_file_validate_citations_sidecar. Reads
    {component}.citations.json instead of regex-parsing the Markdown
    (since the block is no longer guaranteed to be there after extraction).
  • Spot-check logic unchanged — the citation dict shape in the sidecar is
    identical to the former Markdown JSON.

tests/test_citations.py

Seven new tests covering:

  • Strip helper behaviour (strips both sections, idempotent, preserves
    ## Analysis Summary as a false-positive guard, no-op on clean input,
    trailing-newline normalisation).
  • End-to-end build_citations_index contract: Markdown stripped only
    when the sidecar was successfully written; Markdown untouched when
    extraction fails.

Impact

  • api.md dropped from 209 → 115 lines (~3.7 KB) after cleaning. Similar
    reductions across all 11 components in the omlx run.
  • All 254 tests pass (.venv/bin/python -m pytest -x -q).
  • No information loss: every field in the former Markdown ## Citations
    block is present (and enriched) in the sidecar JSON.

Out of scope

Noticed while investigating but not fixed here: the architecture
documenter emits the raw templates/architecture_template.md largely
unfilled (placeholders like {For each: ...} left in place, the
illustrative "User Registration Flow" example copied verbatim).
Separate issue, separate PR.

The component-analyzer emitted two fenced JSON blocks in every
`.md` — `## Citations` and `## Analysis Data` — that were also
redundant with machine-readable artefacts:

- `## Citations` was lifted into `{component}.citations.json` by the
  citation extractor, which enriches it with `component`,
  `source_repo`, `source_commit`, `citation_count`, and per-citation
  `source_url`. The sidecar is strictly richer than the Markdown
  block, so the copy in the `.md` was pure duplication.
- `## Analysis Data` was never consumed anywhere downstream (no
  extractor, no reader, no test) — pure dead weight.

The Markdown is meant for humans; mixing in duplicated machine output
just adds noise (api.md lost 94 lines, ~3.7 KB). Strip both sections
from the Markdown after the sidecar is successfully written, so the
`.md` stays prose + Mermaid + structured human-facing sections only.

Stripping is gated on successful extraction: if the `## Citations`
block fails to parse, the block is left in place so the malformed
content is still visible for debugging.

- agent/utils/citation_extractor.py: new public
  `strip_extracted_sections(md)` helper (idempotent, collapses blank
  runs, normalises trailing newline). `build_citations_index` invokes
  it in-place after each successful per-component sidecar write.
- agent/discovery/validator.py: `_validate_citations_in_file` ->
  `_validate_citations_sidecar`. Reads the `.citations.json`
  sidecar instead of regex-parsing the Markdown (since the block is
  no longer guaranteed to be there). Spot-check logic unchanged — the
  citation dict shape in the sidecar is identical to the former
  Markdown JSON.
- tests/test_citations.py: 7 new tests covering the strip helper
  directly (round-trip strip, idempotence, `## Analysis Summary`
  false-positive guard, no-op on clean input, trailing-newline
  normalisation) and the end-to-end `build_citations_index`
  contract (MD stripped only when sidecar is written; MD untouched
  when extraction fails).

All 254 tests pass.
@0xClandestine 0xClandestine self-requested a review April 22, 2026 16:40
@ethenotethan ethenotethan merged commit 3d73793 into main Apr 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants