2 Preprocessing
2.1 Demultiplexing of fastq files
Because scCUTseq libraries consist of, typically, 384 or 96 cells, we need to demultiplex the fastq files based on the cellular barcodes that are contained within the reads. We do this using a custom python script. Briefly, we extract the cellular barcode and the UMI and move this information to the read name. Unfortunately, we are not able to share the raw fastq files but the two Python scripts used can be found here. This script can be called through the command line and an example of this is as follows:
processBarcode.py -i {input_fastq} -o {output_directory} -b {list_of_barcodes} -d {number_of_mismatches} -p {barcoding_pattern} -t {threads} -r {sequencing_type} -v
Below you can find a brief explanation of the different parameters:
- input_fastq: the path to the input fastq file
- output_directory: the path to the output directory
- list of barcodes: a text file with each row containing one cellular barcode, an example of this file can be found here.
- number_of_mismatches: denotes the number of mismatches that a barcode can have to still be assigned to one of the cellular barcodes, this is by default 1 or 2 depending on the length of the barcodes (8 or 11 in our libraries).
- barcoding_pattern: The pattern of the barcode/UMI in the reads. This is a character string consisting of U/B/D, denoting which base belongs to the UMI (U), Barcode (B) and recognition site (D). In the case of our 96 and 384 cell libraries either UUUUUUUUBBBBBBBBDDDD or UUUUUUUUBBBBBBBBBBBDDDD, respectively.
- threads: Number of threads to use for parallel processing
- sequencing_type: Either ‘single’ or ‘paired’ depending on sequencing type
- v: Verbose processing
2.2 Adapter trimming
In the case of paired-end sequencing on fragments that are too short, we do additional adapter trimming. We use fastp
for this, which is described here. An example of the command used is as follows:
2.3 Aligning of cell specific fastq files
Following the demultiplexing (and potential trimming) we align the cell specific fastq files to the reference genome, in our case GRCh37. We perform the alignment using bwa-mem
, piping it into samtools sort
to save time and disk space. The call used is as follows:
2.5 Deduplication
To deduplicate the bam files we use umi_tools
which takes advantage of the fact that we have UMIs incorporated in our sequencing reads. Once again, this script is called through bash and you can find an example of this below:
umi_tools dedup -I {input_bam} -S {output_bam} -L {output_log} --extract-umi-method tag --umi-tag 'RX:Z' --mapping-quality 30 && samtools index -@ {threads} {output_bam}
You can find the documentation of umi-tools describing all options here.
2.6 QC summary
To make a QC summary report we use a tool called alfred
which can be found here. We run this before and after deduplication as follows:
Following this we summarize the output of both files in an tsv file using the following bash call
zgrep ^ME {prededup_bamfile} | cut -f 2- | sed -n '1p;0~2p' > {outdir}/all.tsv
zgrep ^ME {postdedup_bamfile} | cut -f 2- | sed -n '1p;0~2p' > {outdir}/dedup.tsv
The entire preprocessing workflow is automated and available as a snakemake file and is available in the github repository here.