2 Preprocessing

2.1 Demultiplexing of fastq files

Because scCUTseq libraries consist of, typically, 384 or 96 cells, we need to demultiplex the fastq files based on the cellular barcodes that are contained within the reads. We do this using a custom python script. Briefly, we extract the cellular barcode and the UMI and move this information to the read name. Unfortunately, we are not able to share the raw fastq files but the two Python scripts used can be found here. This script can be called through the command line and an example of this is as follows:

processBarcode.py -i {input_fastq} -o {output_directory} -b {list_of_barcodes} -d {number_of_mismatches} -p {barcoding_pattern} -t {threads} -r {sequencing_type} -v

Below you can find a brief explanation of the different parameters:

input_fastq: the path to the input fastq file
output_directory: the path to the output directory
list of barcodes: a text file with each row containing one cellular barcode, an example of this file can be found here.
number_of_mismatches: denotes the number of mismatches that a barcode can have to still be assigned to one of the cellular barcodes, this is by default 1 or 2 depending on the length of the barcodes (8 or 11 in our libraries).
barcoding_pattern: The pattern of the barcode/UMI in the reads. This is a character string consisting of U/B/D, denoting which base belongs to the UMI (U), Barcode (B) and recognition site (D). In the case of our 96 and 384 cell libraries either UUUUUUUUBBBBBBBBDDDD or UUUUUUUUBBBBBBBBBBBDDDD, respectively.
threads: Number of threads to use for parallel processing
sequencing_type: Either ‘single’ or ‘paired’ depending on sequencing type
v: Verbose processing

2.2 Adapter trimming

In the case of paired-end sequencing on fragments that are too short, we do additional adapter trimming. We use fastp for this, which is described here. An example of the command used is as follows:

fastp -i {input.fastq1} -I {input.fastq2} -o {output.fastq1} -O {output.fastq2} --detect_adapter_for_pe --cut_front --cut_tail --cut_mean_quality 30

2.3 Aligning of cell specific fastq files

Following the demultiplexing (and potential trimming) we align the cell specific fastq files to the reference genome, in our case GRCh37. We perform the alignment using bwa-mem, piping it into samtools sort to save time and disk space. The call used is as follows:

bwa mem -M -t {threads} -R {readgroup_info} {input_fastqs} {reference_file} | \
samtools sort -o {output_bam} && 
samtools index -@ {threads} {output_bam}

2.4 Moving barcodes and UMIs to bam tags

After alignment we move the UMI and barcode information, which is stored in the read name, to the bam tags. We do this using a custom python script which can be found here. This script is also called through bash as follows:

moveBamtags.py -i {input_bam} | samtools view -hbo {output_bam} &&
samtools index {output_bam}

2.5 Deduplication

To deduplicate the bam files we use umi_tools which takes advantage of the fact that we have UMIs incorporated in our sequencing reads. Once again, this script is called through bash and you can find an example of this below:

umi_tools dedup -I {input_bam} -S {output_bam} -L {output_log} --extract-umi-method tag --umi-tag 'RX:Z' --mapping-quality 30 && samtools index -@ {threads} {output_bam}

You can find the documentation of umi-tools describing all options here.

2.6 QC summary

To make a QC summary report we use a tool called alfred which can be found here. We run this before and after deduplication as follows:

alfred qc -r {input_ref} -o {output} {input_bam}

Following this we summarize the output of both files in an tsv file using the following bash call

zgrep ^ME {prededup_bamfile} | cut -f 2- | sed -n '1p;0~2p' > {outdir}/all.tsv
zgrep ^ME {postdedup_bamfile} | cut -f 2- | sed -n '1p;0~2p' > {outdir}/dedup.tsv

The entire preprocessing workflow is automated and available as a snakemake file and is available in the github repository here.