3 Copy number calling

3.1 Transform bam files

To be able to efficiently count reads in (variable width) genomic windows, we first transform bam files into bed files. We do this using the bamtobed function from bedtools. We make sure we only select reads that have a high mapping quality and are properly mapped (and paired in the case of paired-end sequencing). We only select chromosomes 1-22 and X and make sure the the bed file is properly sorted afterwards. We call this as follows:

samtools view -q 30 -F 1024 -b {input} | bedtools bamtobed | grep -E '^[0-9]|^X|^chr[0-9]|^chrX' |\
sort -k1,1V -k2,2n -k3,3n | gzip -c > {output}

3.2 Count reads

We then use a file containing variable width bins, which can be found in the github repository here, and count the reads in each genomic region. We use bedtools intersect to do this. We run this per sample and then combine all the samples (normally all 384 cells from one library) into one large file.

echo {sample_name} > {output}
bedtools intersect -nonamecheck -F 0.5 -sorted -c -a {variable_width_bin_file} -b {input.bed} | cut -f4 >> {output}

paste {input_files} | gzip > {output_file}

3.3 Calling copy numbers

The above generated files are then used to call copy numbers. We use a custom script to call copy number profiles and there are, depending on the use-case, different ways to run this script. Below is an example call and we will explain the parameters used after that.

Rscript cnv_calling.R --counts {counts} --bins {bins} --binsize {binsize} --blacklist {blacklist} --normseg {normseg} --gc {gc_file} --segmentation {segmentation_type} --penalty {segmentation_penalty} --type single --randomforest {path_to_randomforest_model} --rfthreshold {rf_threshold} --removethreshold {removethreshold} --minploidy {minploidy} --maxploidy {maxploidy} --minpurity {minpurity} --maxpurity {maxpurity} --sex {sex} --threads {threads} --output {output}

Below you can find a brief explanation of the different parameters:

counts: file containing binned counts
bins: file containing bins
binsize: integer denoting binsize used
blacklist: file containing regions that should be filtered out (telomeric/centromeric regions for instance)
normseg: additional normalization of regions if required
segmentation: either ‘joint’ or ‘single’ for multipcf or CBS segmentation, respectively.
penalty: penalty threshold used for joint segmentation. If using single segmentation this should be replaced by --prune {prune_penalty} and --alpha {alpha_penalty}
type: ‘single’ or ‘bulk’ depending on if it’s single-cell sequencing or bulk sequencing
randomforest: path to the randomforest model
rfthreshold: threshold used for randomforest classification
minploidy, maxploidy, minpurity and maxpurity: minimum and maximum purity/ploidies for grid search
sex: either male or female depending on sex of the sample sequenced
threads: number of threads to use for parallel processing
output: output directory

3.4 Generate plots

Finally, we generate some profile plots and genomewide heatmaps based on the accomplished copy number calling. We do this with a custom R script which can be called as follows:

Rscript  plotting.R --rds {input} --runtype {run_type} --threads {threads} --outdir {outdir}

Just like the preprocessing section, the copy number calling is also automatized and wrapped in a snakemake pipeline and is available in the github repository here.