-
Notifications
You must be signed in to change notification settings - Fork 0
FEAT: Reapply mask to discarded mutations #412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
- following what Fede (federica.brando@irbbarcelona.org) did in the PR #153
- Also adding a new feature to extract flagged regions to a bed file. - Adding wrapper function to clean-up main. e On branch 118-reapply-mask-discarded-mutation
e Your branch is ahead of 'origin/118-reapply-mask-discarded-mutation' by 1 commit.
- add output as pipeline in 'module.config' - add output in `filtermaf` module - add output in the `mutationpreprocessing`subworkflow
- Added ANNOTATEPANELS with the VEP and other annotation steps - CREATEPANELS now only contains the creation of the modalities of panels
- Bed files emited by ANNOTATEPANEL are used by MUTPREPROCESSING - Added arguments to CREATECONSENSUSPANELS for the ANNOTATEPANEL subworkflow.
- Get the flagged bed into 0-based half-open coordinates. - Useful to use bedtools and filter the positions from the panels.
- publishDir was wrongly done, already corrected - uncommented the filtering logic in CREATEPANELS although it is not doing anything as of now
- Created new process using bedtools to remove positions from panel tsv file using the flagged bed obtained from MUTPREPROCESSING - Added this process to the CREATE_PANELS subworkflow - Added metadata about this process - Added flagged bed as input to CREATEPANELS
-Changed to ANNOTATEPANELS
- Also removed debug printing of channels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request introduces a new filtering mechanism to remove variants from annotated panels that overlap with flagged genomic regions, significantly refactors the panel creation workflow into two distinct subworkflows (ANNOTATEPANELS and CREATEPANELS), and reapplies code reviews from PR #153. The changes enable masking of sites where filters (repetitive_variant, cohort_n_rich, other_sample_SNP) were applied to ensure these regions are properly excluded from downstream analysis.
Changes:
- Added a new FILTER_CAPTURED_PANEL module that uses bedtools intersect to remove variants overlapping flagged regions and track removed variants
- Split panel creation into two workflows: ANNOTATEPANELS (handles VEP annotation, custom annotation, and domain annotation) and CREATEPANELS (applies filtering and generates final panels)
- Refactored filter_cohort.py with improved code structure, logging, documentation, and a new extract_flagged_regions_bed function to output flagged positions in BED format
- Updated workflow dependencies so MUTPREPROCESSING outputs flagged_bed that feeds into CREATEPANELS for filtering
Reviewed changes
Copilot reviewed 10 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/deepcsa.nf | Reorganized workflow to use ANNOTATEPANELS before MUTPREPROCESSING, then CREATEPANELS with flagged_bed input; updated references from CREATEPANELS to ANNOTATEPANELS outputs where appropriate |
| subworkflows/local/mutationpreprocessing/main.nf | Added flagged_bed emission from FILTERBATCH output to enable downstream filtering |
| subworkflows/local/createpanels/createpanels/main.nf | Refactored to accept complete_annotated_panel and flagged_bed inputs, applies FILTERCAPTUREDPANEL before generating panels, removed annotation logic moved to ANNOTATEPANELS |
| subworkflows/local/createpanels/createpanels/meta.yml | Added empty meta.yml file (requires documentation) |
| subworkflows/local/createpanels/annotatepanels/main.nf | New subworkflow containing annotation logic (VEP, custom annotation, domain annotation) previously in createpanels, generates initial panels for MUTPREPROCESSING |
| subworkflows/local/createpanels/annotatepanels/meta.yml | Comprehensive documentation for the annotation workflow with detailed input/output descriptions |
| modules/local/filtermaf/main.nf | Added flagged_muts output emission for flagged-pos.bed file |
| modules/local/filter_captured_panel/main.nf | New module that filters annotated panel using bedtools intersect to remove variants overlapping flagged regions, outputs both filtered panel and removed variants |
| modules/local/filter_captured_panel/meta.yml | Documentation for the filtering module |
| conf/modules.config | Split FILTERBATCH configuration to publish .bed files, disabled publishing for MERGEBATCH |
| conf/tools/panels.config | Updated process paths from CREATEPANELS to ANNOTATEPANELS for VEP and related processes |
| bin/utils.py | Improved documentation for add_filter function with proper parameter descriptions |
| bin/filter_cohort.py | Major refactoring: split monolithic main function into separate functions (flag_repetitive_variants, flag_cohort_n_rich, flag_other_samples_snp, flag_gnomad_snp, expand_filter_column, extract_flagged_regions_bed), added logging, improved documentation, added BED output generation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Publish FILTERCAPTUREDPANEL output into createpanels folder - Add bedtools to versions instead of python
FerriolCalvet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good Marta!
I added some minor comments in some files but I also have some more general comments that I would like to discuss before merging, mainly because there are different masking scenarios that probably I did not acknowledge properly :pray-hands:
(I am trying to explain what I understood to make sure that I got it properly)
The way it works for the masking is by removing the positions from the panels let's say that there is a flagged position in the middle of an exon, this would result in the exon in the BED regions of the consensus panel being split in two right? and on that position not being in the consensus panel TSV.
OncodriveFML requires a BED file with the coordinates of the different genomic regions, this is probably not impacting at all, but we could look at the consequence of this change outside omega.
Initially I thought that a possibility could be to "just" putting the depths to 0 of that position and then this could have a similar effect to the final definitions of panels. The reality is that with the current implementation this works perfect for positions that need to be masked for the entire cohort, but does not allow the flexibility of removing a position only in one sample.
We can discuss whether we want to merge this implementation and then work on an additional processing that enables the sample-specific tuning of depths that could then impact the panel building process downstream.
Minor more general comments:
- do we want to use the ymls? I guess it does no harm here, but we should discuss it, and probably apply it as part of the clean branch
| label 'process_single' | ||
| label 'process_medium_high_memory' | ||
|
|
||
| container "community.wave.seqera.io/library/bedtools_pybedtools_pandas_pip_pruned:78080da05d53636d" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to take the Dockerfile and upload the image to our dockerhub and the dockerfile to our docker github repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revise where else it is used and maybe replace everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is in createcapturedpanels
|
Thank you for the review Ferriol! I will check the comments tomorrow, but I wanted to answer the general comment so everything is here in case we need it. You're right, removing a position from the panel makes the regions in the bed to be split. About the ymls, I think it could be useful for new people trying to understand how the pipeline works. But I am okay with moving them to the clean branch. |
|
Some time ago I made some fixes to prevent the 0s problem Fede experienced above (4551e17), so hopefully this should not be a problem now. I don't remember if there were other methods for which this was also a problem but I think it is worth testing it again. And yes, agree with the ymls, I would leave the ones you created here and then try to add updated ones for all steps in the clean branch. let's discuss soon! |
|
We have decided to try applying the option of forcing the depth to 0 for those positions to be removed in specific samples. If it is too complex, we could merge this branch and do it later. |
- Following review commentsy
Following Fede's modifications: - Add logs + docstrings - Refactor code to improve modularity - Add `mask_panel_regions` to assign depth 0 (commented until getting bed per sample)
- Expand filter column - Extract flagged regions bed So we can reuse them to extract the sample-specific flagged positions bed. e (use "git push" to publish your local commits)
- Added publish dir (at least temporal) - Using utils_filter - Creates the bed files used to do per sample filtering of the depths
- WIth the code refactoring, this input was not needed (input.csv)
- Discussed in #412 - Removed duplication in module.conf while adding filter_criteria and filter_criteria_somatic to FILTERBATCH step.
UpdateAbout the depth = 0 mask The problem
I like option 3. Do you have any thoughts about this? |
|
I was thinking on something like option 3 as well! I understand the memory concerns but the good thing is that since the masking of each position is independent from the others the processing of the matrix could be split in chunks if needed. But for now I would not worry about it and we can test it in the duplexomes once you have a working version of it for a smaller panel such as the bladder one. |
- Python script creating mask matrix from bed file -> position as rows and samples as columns (1 if we keep the position 0 if we mask it) - Module to get all the bed files and use the create_mask_matrix.py
- Create channel with all the bed files to be used by CREATEMASKMATRIX - Updated outputs from MUTPREPROCESSING to emit the mask_matrix - Added mask matrix as input for ANNOTATEDEPTHS
- Update function to use mask_matrix based filtering
- Create mask matrix - Apply mask matrix
- I have gone back to the status previous to the modifications done to create panels - All the filtering is done now by masking the depth file with depth = 0 and the sample_flagged_positions_2bed
Copilot summary
This pull request introduces a new filtering step for annotated panels to remove variants overlapping flagged genomic regions, updates the panel creation workflow to integrate this filtering, and makes several related configuration and workflow changes. Additionally, some code reviews from PR153 are reapplied
Key changes:
New module for filtering captured panels:
filter_captured_panelprocess (modules/local/filter_captured_panel/main.nf) and its metadata (modules/local/filter_captured_panel/meta.yml), which filters annotated panels usingbedtools intersectto remove variants overlapping flagged regions. Outputs both filtered panels and a list of removed variants. [1] [2]Integration of filtering into panel creation workflow:
subworkflows/local/createpanels/createpanels/main.nf, previouslymain.nf) to include the new filtering step (FILTERCAPTUREDPANEL). The workflow now takes aflagged_bedinput fromMUTPREPROCESSINGand generates panels from the filtered annotated panel. The output now includes the list of removed variants. [1] [2] [3]subworkflows/local/createpanels/annotatepanels/main.nf). [1] [2]Workflow and configuration updates:
workflows/deepcsa.nf) to use the new, more modularCREATEPANELSandANNOTATEPANELSsubworkflows.flagged_bedoutput from theFILTERBATCHprocess, enabling downstream filtering.FILTERBATCHprocess inconf/modules.configto publish.bedfiles for flagged mutations, and adjustedMERGEBATCHand related processes to disable publishing..bedfiles to theFILTER_BATCHprocess inmodules/local/filtermaf/main.nf.