Category: Enhancement / Refactor
Overview
This issue addresses several maintainability and code quality concerns across the clean_validate.py and search_top_posts.py modules. The goal is to make the content pipeline more robust, configurable, and easier to debug or extend.
Problem Areas
- Hardcoded Thresholds & Magic Numbers
Minimum word count, paragraph requirements, and other constants are scattered and hardcoded in logic and prompts.
Proposed Solution:
Move all such values to a shared configuration or constants module, and refer to them symbolically in code and prompts. Update documentation to show how these values can be customized.
- Error Handling and Logging
Several except Exception blocks could hide bugs; insufficient distinction between scraper, validation, and API errors.
Debug print statements and logs could expose sensitive data or cause log noise in production.
Proposed Solution:
Refactor error handling for granularity and clarity, log actionable (but sanitized) details, and use proper log levels. Remove any direct print statements.
- Prompt & Template Handling, JSON Parsing
Prompt templates are loaded from a fixed path without robust fallback.
Ad-hoc/regex parsing of LLM JSON responses risks brittle failures.
Proposed Solution:
Implement utility functions for safe prompt/template loading and LLM JSON extraction/parsing. Provide error or fallback messages for missing templates; thoroughly test these utilities.
- Data Quality and Schema Consistency
Validation and filtering logic is duplicated. There is inconsistency in schema documentation for output data, especially on error paths.
Proposed Solution:
Move post quality validation into reusable functions and ensure all output structures are consistently documented and enforced.
- Testing: Fallbacks and Edge Cases
Current test suite does not mock Gemini, scraper, or search client failures. Fallback scenarios and error paths are undertested.
Proposed Solution:
Add unit and integration tests to cover all major edge/failure cases; use fixtures to simulate external dependency issues.
Acceptance Criteria
I am a GSSoC'25 contributor and would like to take up this issue. Please assign it to me!
Category: Enhancement / Refactor
Overview
This issue addresses several maintainability and code quality concerns across the clean_validate.py and search_top_posts.py modules. The goal is to make the content pipeline more robust, configurable, and easier to debug or extend.
Problem Areas
Minimum word count, paragraph requirements, and other constants are scattered and hardcoded in logic and prompts.
Proposed Solution:
Move all such values to a shared configuration or constants module, and refer to them symbolically in code and prompts. Update documentation to show how these values can be customized.
Several except Exception blocks could hide bugs; insufficient distinction between scraper, validation, and API errors.
Debug print statements and logs could expose sensitive data or cause log noise in production.
Proposed Solution:
Refactor error handling for granularity and clarity, log actionable (but sanitized) details, and use proper log levels. Remove any direct print statements.
Prompt templates are loaded from a fixed path without robust fallback.
Ad-hoc/regex parsing of LLM JSON responses risks brittle failures.
Proposed Solution:
Implement utility functions for safe prompt/template loading and LLM JSON extraction/parsing. Provide error or fallback messages for missing templates; thoroughly test these utilities.
Validation and filtering logic is duplicated. There is inconsistency in schema documentation for output data, especially on error paths.
Proposed Solution:
Move post quality validation into reusable functions and ensure all output structures are consistently documented and enforced.
Current test suite does not mock Gemini, scraper, or search client failures. Fallback scenarios and error paths are undertested.
Proposed Solution:
Add unit and integration tests to cover all major edge/failure cases; use fixtures to simulate external dependency issues.
Acceptance Criteria
I am a GSSoC'25 contributor and would like to take up this issue. Please assign it to me!