Monorepo for PDF ingestion, bibliography extraction, metadata enrichment, and download queueing.
frontend/: Svelte + Vite appbackend/: Node.js API and orchestration routesbackend/scripts/daemon/worker.py: queue consumer daemondl_lit_project/: canonical Python pipeline package (dl_lit)dl_lit/: legacy scripts (not canonical runtime)
The app is DB-first and queue-first.
- Backend writes jobs to
pipeline_jobsindl_lit_project/data/literature.db. rag_feeder_workerpollspipeline_jobsand executes jobs.- Worker writes completion/failure payloads back to
pipeline_jobs.result_json.
Supported daemon job types in current code:
enrichdownloadpipeline_tick(mark -> enrich -> download)
Primary data flow:
no_metadata -> with_metadata -> downloaded_references
Download queue state is tracked in with_metadata.download_state (queued / in_progress / etc.).
to_download_references remains in schema for compatibility but the current queue path is state-based.
rag_feeder_frontendonhttp://localhost:5175rag_feeder_backendonhttp://localhost:4000rag_feeder_worker(no HTTP port)
- Set
.envvalues (at leastGOOGLE_API_KEY, optionalOPENALEX_API_KEY). - Start stack:
docker compose up -d
- Open:
http://localhost:5175
- SQLite DB:
dl_lit_project/data/literature.db - Uploaded PDFs inside container:
/usr/src/app/uploads - Upload volume:
rag_feeder_uploads - Logs volume:
rag_feeder_logs - Pipeline log file:
/usr/src/app/logs/backend-pipeline.log
/api/ingest/process-marked, /api/downloads/worker/start, and /api/downloads/worker/run-once queue real jobs immediately.
/api/pipeline/worker/start and /api/pipeline/worker/pause currently update in-memory API state; continuous interval scheduling is transitional in current implementation.
- Backend details:
backend/README.md - Frontend details:
frontend/README.md - Python pipeline details:
dl_lit_project/README.md