Legacy Alignment QC (not supported from 2022)
Please check out help documentation of new analysis pipelines.
Table of Contents
- Overview
- QC of transcriptomic data
- QC of genomic data
- QC of single cell transcriptomic data
- Software and version information
- Command line
- Output format
- List of resources
- Change logs
Overview
We process sequencing data through our QC pipeline and generate alignment quality report for a variety of assay types. A list of supported library types and species information can be found below.
List of supported library type and species
Library type | Species |
PolyA-RNA / Total-RNA | Human |
ChIP-Seq | Human |
ATAC-Seq | Human |
Single cell 3' RNA (10X Genomics) | Human |
WGS | Human |
QC of transcriptomic data
Sequencing data for RNA-Seq samples are adapter trimmed using Fastp and mapped against a reference transcriptome using splice aware aligner STAR. We follow the Alternate Protocol 7 from Mapping RNA-seq Reads with STAR and generate both genomic and transcriptomic bam files after aligning fastq files from different lanes and flowcells. Each lane level bams are tagged with relevant read group information before they are merged together to create a library level bam followed by marking the duplicate reads (without removal) using Picard tool. We generate raw counts per gene using the FeatureCounts tool from the merged genomic bam. Also, normalised counts are generated using RSEM tool from the merged transcriptome bam. A bigwig signal file is generated from the genomic bam file using STAR, following the Alternate Protocols 4 from the above mentioned publication. All the Picard and Samtools metrics are merged using MultiQC for a library level report.
QC of genomic data
Sequencing data for genomic samples are adapter trimmed using Fastp and mapped against a reference genome using alignment tool BWA. We follow GATK’s Data pre-processing for variant discovery document and post process the raw alignment. Each lane level bams are tagged with relevant read group information before they are merged to a library level bam, after which duplicate reads are marked. We generate Picard and Samtools metrics for the library level bam and combine them using MultiQC. For the epigenome assays (purple arrow), we generate few additional metrics using Phantompeakqualtools and deepTools and add them to the same MultiQC report. For WGS samples (orange arrow), we process the duplicate marked bams with GATK BaseRecalibrator and ApplyBQSR before generating per sample GVCF files using HaplotypeCaller tool.
QC of single cell transcriptomic data
Please check the Single Cell Transcriptome Analysis page for more details.
Software and version information
- Fastp v0.20.1
- STAR v2.6.0a
- BWA v0.7.17
- RSEM v1.3.1
- FeatureCounts v1.6.4
- Picard v2.19
- Samtools v1.9
- MultiQC v1.9
- Phantompeakqualtools v1.2.0
- deepTools v3.2.1
- GATK v4.1.4.1
Command line
Read preprocessing
Adapter trimming
Tool name: Fastp
fastp
-a auto
--in1 /path/input/sample.R1.fastq.gz
--out1 /path/trimmed/sample.R1.fastq.gz
--html /path/trimmed/sample.report.html
--json /path/trimmed/sample.report.json
--report_title sample
--thread threads
--in2 /path/input/sample.R2.fastq.gz
--out2 /path/trimmed/sample.R2.fastq.gz
--qualified_quality_phred=15
--length_required=15
--trim_poly_g # for NEXTSEQ
Read alignment
Genomic DNA alignment
Tool name: BWA
bwa
mem
-t threads
-M
/path/bwa_ref_genome
/path/trimmed/sample.R1.fastq.gz /path/trimmed/sample.R2.fastq.gz | samtools view -bo aligned.bam -
RNA-Seq alignment
Tool name: STAR
STAR
--runThreadN threads
--outFileNamePrefix /path/mapped/sample
--outSAMattributes NH HI AS NM MD
--runMode alignReads
--outFilterType BySJout
--quantMode TranscriptomeSAM GeneCounts
--sjdbGTFfile gene.gtf
--genomeLoad NoSharedMemory
--outSAMunmapped Within
--outSAMtype BAM SortedByCoordinate
--genomeDir /path/star_ref_transcriptome
--outFilterMultimapNmax 20
--alignIntronMin 20
--alignSJDBoverhangMin 1
--outFilterMismatchNoverReadLmax 0.04
--alignMatesGapMax 1000000
--limitBAMsortRAM 12000000000
--outFilterMismatchNmax 999
--alignIntronMax 1000000
--alignSJoverhangMin 8
--twopassMode Basic
--readFilesCommand zcat
--readFilesIn /path/trimmed/sample.R1.fastq.gz /path/trimmed/sample.R2.fastq.gz
Post-alignment processing
Add Read Group tags
Tool name: Picard AddOrReplaceReadGroups
java
-XX:ParallelGCThreads=threads
-XmxMg
-Djava.io.tmpdir=/path/temp
-jar picard.jar
AddOrReplaceReadGroups
RGPL=PLATFORM
RGPU=UNIQUE_RG_PU
RGLB=LIBRARY_ID
SORT_ORDER=coordinate # or unsorted for STAR transcriptome BAM
RGSM=SAMPLE_ID
RGCN=CENTER_NAME
RGID=UNIQUE_RG_ID
I=/path/mapped/sampleAligned.sortedByCoord.out.bam
O=/path/mapped/sampleAligned.sortedByCoord.out.AddOrReplaceReadGroups.bam
Mark duplicate reads
Tool name: Picard Mark duplicates
java
-XX:ParallelGCThreads=thread
-XmxMg
-Djava.io.tmpdir=/path/temp
-jar picard.jar
MarkDuplicates
O=/path/mapped/sample.genome.MarkDuplicates.bam
M=/path/mapped/sample.genome.MarkDuplicates.summary.txt
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 # for HISEQ 4000 and NextSeq
I=/path/mapped/sampleAligned.sortedByCoord.out.AddOrReplaceReadGroups.bam
Signal bigwig
Epigenome signal bigwig
Tool name: deepTools bamCoverage
bamCoverage
--bam filtered.bam
--outFileFormat bigwig
--outFileName bam_cov.bw
--blackListFileName /ENCODE/ENCFF419RSJ.bed
--numberOfProcessors threads
Transcriptome signal bigwig
Tool name: STAR bigwig
STAR
--runThreadN threads
--runMode inputAlignmentsFromBAM
--outFileNamePrefix /path/signal/sample
--genomeLoad NoSharedMemory
--outWigType bedGraph
--outWigStrand Stranded
--inputBAMfile /path/mapped/sample.genome.MarkDuplicates.bam
Transcriptome specific analysis
Expression quantification
Tool name: FeatureCounts
featureCounts
-a genes.gtf
-o /path/output
-T threads
/path/mapped/sample.genome.MarkDuplicates.bam
Tool name: RSEM
rsem-calculate-expression
--quiet
--no-bam-output
--alignments
--strandedness reverse
--num-threads threads
--ci-memory 4000
--estimate-rspd
--paired-end
/path/mapped/sample.merged.transcriptome.bam
RSEM_REF
/path/rsem/sample
Epigenome quality check
Filter alignment
Tool name: Samtools
samtools view
-q 20
-F 1804 ## For PE, or 1796 for SE
-bo filtered.bam
raw_input.bam
Strand cross-correlation metrics
Tool name: Phantompeakqualtools
Rscript phantompeakqualtools/run_spp.R
-c=filtered.bam
-rf
-p=threads
-savp=PPQT.spp.pdf
-out=PPQT.spp.out
-odir=/path/output
Sequence coverage
Tool name: deepTools plotCoverage
plotCoverage
-b filtered.bam
-p threads
--blackListFileName /ENCODE/ENCFF419RSJ.bed
--outRawCounts coverage.tab > plot_cov.out
Epigenome quality fingerprint
Tool name: deepTools plotFingerprint
plotFingerprint
--bamfiles filtered.bam
--outQualityMetrics fingerprint.m
-p 4
--blackListFileName /ENCODE/ENCFF419RSJ.bed
--outRawCounts fingerprint.tab
WGS post alignment processing
GATK BaseRecalibrator
gatk
BaseRecalibrator
-I /path/sample.MarkDuplicates.bam
-O /path/sample.MarkDuplicates.bam_BaseRecalibrator.table
--reference /path/fasta
--java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
--known-sites /path/dbsnp.vcf
--known-sites /path/known_indels.vcf.gz
GATK ApplyBQSR
gatk
ApplyBQSR
--bqsr-recal-file /path/sample.MarkDuplicates.bam_BaseRecalibrator.table
--create-output-bam-index
--emit-original-quals
-I /path/sample.MarkDuplicates.bam
-O /path/sample.MarkDuplicates_ApplyBQSR.bam
--reference /path/fasta
--java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
GATK BaseRecalibrator (post ApplyBQSR)
gatk
BaseRecalibrator
-I /path/sample.MarkDuplicates_ApplyBQSR.bam
-O /path/sample.MarkDuplicates_ApplyBQSR.bam_BaseRecalibrator.table
--reference /path/fasta
--java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
--known-sites /path/dnsnp.vcf
--known-sites /path/known_indels.vcf.gz
GATK HaplotypeCaller GVCF
gatk
HaplotypeCaller
-I /path/sample.MarkDuplicates_ApplyBQSR.bam
-O /path/sample.MarkDuplicates_ApplyBQSR_HaplotypeCaller.g.vcf.gz
--reference /path/fasta
--dbsnp /path/dbsnp.vcf
--java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
--emit-ref-confidence GVCF
Output format
Alignment file
- Cram: Per sample cram file. Chech the Reference genome datasets page for details about the reference genome build used for cram conversion.
Gene count files
- FeatureCounts: Raw read counts per gene
- RSEM: Normalised (TPM and RPKM) read counts per gene and isoforms
Signal files
- STAR bigwig: Per sample bigwig signal file generated by STAR
- deepTools bigwig: Per sample bigwig signal file generated by deepTools bamCoverage
Strand cross-correlation pdf
- Phantompeakqualtools pdf: Strand cross-correlation plot with NSC and RSC metrics
GVCF
- GATK HaplotypeCaller GVCF: Per sample GVCF file produced by GATK HaplotypeCaller
MultiQC report
A multiqc report for the alignment bam is produced (per sample) combining metrics from the following tools
- Fastp
- STAR (RNA-Seq)
- Picard Mark Duplicates
- Picard CollectAlignmentSummaryMetrics
- Picard CollectBaseDistributionByCycle
- Picard CollectGcBiasMetrics
- Picard QualityScoreDistribution
- Picard CollectRnaSeqMetrics (RNA-Seq)
- Samtools idxstats
- Samtools stats
- FeatureCounts (RNA-Seq)
- Phantompeakqualtools (Epigenome)
- deepTools (Epigenome)
- GATK BQSR (pre and post ApplyBQSR) (WGS)
List of resources
Change logs
- 29 January 2021
- Moved to Gencode v36 build from v30
- Updated Fastp from v0.19.8 to v0.20.1
- 20 December 2020
- Updated MultiQC from v 1.7 to v 1.9
- 03 January 2020
- Added WGS post alignment processing by GATK
- Added GATK HaplotypeCaller GVCF generation for WGS samples
- 25th June 2019
- Added epigenome specific metrics
- Added “-a auto” flag to Fastp command
- Moved to Gencode v30 build from v28
- 24th April 2019
- Updated MultiQC from v 1.6 to v 1.7
- Updated Fastp from v 0.19.3 to v 0.19.8
- Added metrics from Samtools stats and removed Samtools flagstat