3 Copy number calling
3.1 Transform bam files
To be able to efficiently count reads in (variable width) genomic windows, we first transform bam files into bed files. We do this using the bamtobed
function from bedtools
. We make sure we only select reads that have a high mapping quality and are properly mapped (and paired in the case of paired-end sequencing). We only select chromosomes 1-22 and X and make sure the the bed file is properly sorted afterwards. We call this as follows:
3.2 Count reads
We then use a file containing variable width bins, which can be found in the github repository here, and count the reads in each genomic region. We use bedtools intersect
to do this. We run this per sample and then combine all the samples (normally all 384 cells from one library) into one large file.
3.3 Calling copy numbers
The above generated files are then used to call copy numbers. We use a custom script to call copy number profiles and there are, depending on the use-case, different ways to run this script. Below is an example call and we will explain the parameters used after that.
Rscript cnv_calling.R --counts {counts} --bins {bins} --binsize {binsize} --blacklist {blacklist} --normseg {normseg} --gc {gc_file} --segmentation {segmentation_type} --penalty {segmentation_penalty} --type single --randomforest {path_to_randomforest_model} --rfthreshold {rf_threshold} --removethreshold {removethreshold} --minploidy {minploidy} --maxploidy {maxploidy} --minpurity {minpurity} --maxpurity {maxpurity} --sex {sex} --threads {threads} --output {output}
Below you can find a brief explanation of the different parameters:
- counts: file containing binned counts
- bins: file containing bins
- binsize: integer denoting binsize used
- blacklist: file containing regions that should be filtered out (telomeric/centromeric regions for instance)
- normseg: additional normalization of regions if required
- segmentation: either ‘joint’ or ‘single’ for
multipcf
orCBS
segmentation, respectively. - penalty: penalty threshold used for joint segmentation. If using single segmentation this should be replaced by
--prune {prune_penalty}
and--alpha {alpha_penalty}
- type: ‘single’ or ‘bulk’ depending on if it’s single-cell sequencing or bulk sequencing
- randomforest: path to the randomforest model
- rfthreshold: threshold used for randomforest classification
- minploidy, maxploidy, minpurity and maxpurity: minimum and maximum purity/ploidies for grid search
- sex: either male or female depending on sex of the sample sequenced
- threads: number of threads to use for parallel processing
- output: output directory
3.4 Generate plots
Finally, we generate some profile plots and genomewide heatmaps based on the accomplished copy number calling. We do this with a custom R script which can be called as follows:
Just like the preprocessing section, the copy number calling is also automatized and wrapped in a snakemake pipeline and is available in the github repository here.