Skip to content

fix: use full URL path in source display names#2

Open
swinney wants to merge 2 commits intodevfrom
fix/source-display-name-truncation
Open

fix: use full URL path in source display names#2
swinney wants to merge 2 commits intodevfrom
fix/source-display-name-truncation

Conversation

@swinney
Copy link
Copy Markdown
Member

@swinney swinney commented Apr 10, 2026

Summary

  • Bug: ScrapedResource._format_link_display() truncated URL paths to only the first segment, making all source citations from the same domain look identical
  • Fix: Use the full URL path in the display name instead of just the first path segment
  • Impact: Source citations in chat responses now show the actual page (e.g. docs.rc.fas.harvard.edu/kb/abaqus) instead of a generic prefix (docs.rc.fas.harvard.edu/kb)

Root Cause

In src/data_manager/collectors/scrapers/scraped_resource.py:66, the display name was built by splitting the URL path on / and taking only element [0]:

first_path = parsed_link.path.strip('/').split('/')[0]
display_name += f"/{first_path}"

For a URL like https://docs.rc.fas.harvard.edu/kb/cluster-storage/, the path /kb/cluster-storage/ was reduced to just kb, producing the display name docs.rc.fas.harvard.edu/kb for every single FASRC docs page.

Before/After

URL Before After
.../kb/abaqus/ docs.rc.fas.harvard.edu/kb docs.rc.fas.harvard.edu/kb/abaqus
.../kb/cluster-storage/ docs.rc.fas.harvard.edu/kb docs.rc.fas.harvard.edu/kb/cluster-storage
.../kb/running-jobs/ docs.rc.fas.harvard.edu/kb docs.rc.fas.harvard.edu/kb/running-jobs

Notes

  • The full URL was always stored correctly in extra["url"] metadata — no data was lost, only the display_name used for chat citations was affected
  • Existing documents will need to be re-ingested for corrected display names to appear
  • Tested on our FASRC deployment only

Test plan

  • Re-ingest documents after deploying the fix
  • Verify chat responses now show distinct source citations per page
  • Verify other source types (Slurm docs, git repos) also display correctly

🤖 Generated with Claude Code

Austin Swinney and others added 2 commits April 10, 2026 17:16
_format_link_display() was truncating the URL path to only the first
segment, causing all scraped sources to show an identical, unhelpful
display name. For example, every FASRC docs page appeared as
"docs.rc.fas.harvard.edu/kb" regardless of the actual page, because
the code split the path on "/" and took only element [0].

Before: docs.rc.fas.harvard.edu/kb  (for /kb/abaqus/)
Before: docs.rc.fas.harvard.edu/kb  (for /kb/cluster-storage/)
After:  docs.rc.fas.harvard.edu/kb/abaqus
After:  docs.rc.fas.harvard.edu/kb/cluster-storage

This made it impossible for users to identify which documentation page
a response was sourced from, since every citation looked identical.

Note: the full URL was always stored correctly in the document metadata
(extra["url"]), so no data was lost. Only the display_name used for
citations in chat responses was affected. Existing documents will need
to be re-ingested to pick up corrected display names.

Tested on our FASRC deployment only. Other deployments with multi-segment
URL paths will also benefit from this fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Configure the marked.js renderer to add target="_blank" and
rel="noopener noreferrer" to all links in chat responses. This
ensures source citations open in a new browser tab instead of
navigating away from the chat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant