Harden from_profile mode genome selection with new genomes, config loading, and output staging#243
Open
Harden from_profile mode genome selection with new genomes, config loading, and output staging#243
Conversation
feat(metagenomic): add support for additional genomes with quality filtering - Implement prioritization of additional genomes over reference genomes - Add quality filtering parameters for additional genomes (completeness, contamination, contig count thresholds) - Create configuration conversion logic for new genome selection parameters - Update genome selection algorithm to handle quality-based grouping - Add quality file validation and parsing functionality - Support mixed reference and additional genome sampling strategies - Update Nextflow pipeline to pass new parameters to genome selection script ```
Improve portability and robustness of profile-driven runs without changing the intended matching logic. - Resolve external `params.config` paths relative to the launch directory first, with a project-dir fallback, so runs started outside the repo can load configs reliably. - Make profile-mode defaults explicit in `from_profile.nf` and publish `internal/genome_locations.tsv` plus `distributions/abundance_*.tsv` when `just_community_design=true`, so community-design-only runs emit the same reusable files as a full run. - Preserve selection behavior in `get_genomes.py` while fixing edge cases: support headered and 3-column additional-reference files, fall back from `ete4` to `ete3`, fix taxonomy-name retry lookup, validate and draw strain counts correctly, and generate deterministic unique genome filenames. - Speed up large profile runs by caching taxonomy lookups, avoiding quadratic `no_replace` removals, and streaming output files instead of reopening them for each OTU. - Prevent cleanup collisions for genomes that share the same basename, accept lowercase boolean config values for `verbose`, refresh config comments/defaults, and ignore local run configs in Git.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR hardens the profile-driven metagenomic workflow around
get_genomes.pyand related Nextflow plumbing. The main goals are to support curated additional genomes with explicit quality gating, preserve the existing rank-first matching logic while adding source prioritization within a matched taxon, make profile-only community-design runs publish the expected core outputs, and fix several robustness issues observed in real runs.Compared with
dev, this branch changes the following files:.gitignoreconvert_config.pynextflow.configpipelines/metagenomic/config/distribution.configpipelines/metagenomic/config/metagenomic.configpipelines/metagenomic/config/profile.configpipelines/metagenomic/from_profile.nfpipelines/metagenomic/metagenomic.nfpipelines/metagenomic/scripts/clean_up_sequences.pypipelines/metagenomic/scripts/get_genomes.pypipelines/shared/scripts/get_community_distribution.pyWhat changed
1. Make external config loading work reliably from outside the repo
File changed:
nextflow.configresolveExternalConfigPath(...)and use it forparams.config.projectDiras a backward-compatible fallbackReason:
nextflow run ../pipeline/CAMISIM/main.nf --config ./some_configfrom another working directory was brittle because config resolution depended on pipeline-relative paths.2. Add profile-mode parameters for curated additional genomes and wire them through the workflow
Files changed:
convert_config.pypipelines/metagenomic/config/profile.configpipelines/metagenomic/from_profile.nfChanges:
prioritize_additional_genomesadditional_genomes_quality_fileadditional_genomes_max_contaminationadditional_genomes_min_completenessadditional_genomes_max_num_contigsconvert_config.pyto emit these parameters and normalize boolean/string handling for them.from_profile.nfto provide explicit defaults for profile-only params when an external config does not importprofile.config.get_genomes.py.additional_referencesaccess withparams.containsKey(...)so profile runs do not crash when the parameter is absent.profile.configcomments to document acceptedadditional_referencesformats and the new quality-driven selection behavior.Reason:
3. Preserve rank-first matching while adding source-aware genome selection
File changed:
pipelines/metagenomic/scripts/get_genomes.pyChanges:
prioritize_additional_genomes = true, draw from high-quality additional genomes first, then public reference genomes, then low-quality additional genomes as last resortfalse, draw from high-quality additional genomes and public references together, then use low-quality additional genomes only after the preferred pool is exhaustedfill_upfallback path.additional_genomes_quality_filewhenadditional_referencesis set.genome_path,completeness,contamination, andnum_contigs.Reason:
4. Make additional-genome and reference-genome table parsing more robust
File changed:
pipelines/metagenomic/scripts/get_genomes.pyChanges:
additional_referencesfiles:ncbi_taxid,scientific_name,genome_pathncbi_taxid,scientific_name,genome_path,novelty_categorynovelty_categorytoknown_strainwhen omitted.Reason:
5. Fix taxonomy-name fallback and improve compatibility with installed ETE versions
File changed:
pipelines/metagenomic/scripts/get_genomes.pyChanges:
ete4toete3forNCBITaxa.Reason:
ete3available but notete4, causing the profile workflow to fail before matching started.6. Fix strain-count handling for profile mode
Files changed:
pipelines/metagenomic/scripts/get_genomes.pypipelines/metagenomic/config/metagenomic.configChanges:
min_strains_per_otuandmax_strains_per_otubounds explicitly.draw_num_strains(...)to handle edge cases cleanly.max_strains_per_otu = 1is valid in profile mode.Reason:
min=1, max=2, and did not explicitly guard invalid bounds.7. Publish the expected community-design outputs even when
just_community_design = trueFile changed:
pipelines/metagenomic/from_profile.nfChanges:
outdir/distributions/during profile-mode genome selection.abundance_*.tsvintooutdir/distributions/.genome_to_id.tsvinto both:outdir/internal/genome_to_id.tsvoutdir/internal/genome_locations.tsvReason:
just_community_design = truestopped the workflow early before these files were published, which made the output layout incomplete for downstream reuse.8. Prevent genome filename collisions when many input FASTAs share the same basename
Files changed:
pipelines/metagenomic/scripts/get_genomes.pypipelines/metagenomic/metagenomic.nfpipelines/metagenomic/scripts/clean_up_sequences.pyChanges in
get_genomes.py:Changes in
metagenomic.nf:[genome_id, path]entries intocleanup_and_filter_sequencesinstead of flattening to basenames.Changes in
clean_up_sequences.py:genomic.fna.Reason:
9. Improve
get_genomes.pyperformance without changing the intended matching outputFile changed:
pipelines/metagenomic/scripts/get_genomes.pyChanges:
no_replaceremoval does not scan every rank/taxon bucket repeatedly.genome_to_id.tsv,metadata.tsv, andabundance_*.tsvinstead of reopening files for every OTU.Reason:
get_genomes.pywas noticeably slow on larger profiles, and the previous removal and file-writing paths had avoidable overhead.10. Normalize boolean handling for community-distribution verbosity
Files changed:
pipelines/shared/scripts/get_community_distribution.pypipelines/metagenomic/config/distribution.configChanges:
verbose, includingtrue/false,True/False,1/0, andyes/no."False"tofalse.Reason:
TrueandFalse, which was inconsistent with the rest of the Nextflow configs.11. Ignore local run-specific config files in Git
File changed:
.gitignoreChanges:
configs/camisim.cfgmetax_bench_data.configReason:
12. Correct
*,coverin.gitignoreFile changed:
.gitignoreChanges:
*,coverto*.coverReason: