Legacy Alignment QC (not supported from 2022)

Please check out help documentation of new analysis pipelines.

Overview
- List of supported library type and species
QC of transcriptomic data
QC of genomic data
QC of single cell transcriptomic data
Software and version information
Command line
Output format
List of resources
Change logs

Overview

We process sequencing data through our QC pipeline and generate alignment quality report for a variety of assay types. A list of supported library types and species information can be found below.

List of supported library type and species

Library type	Species
PolyA-RNA / Total-RNA	Human
ChIP-Seq	Human
ATAC-Seq	Human
Single cell 3' RNA (10X Genomics)	Human
WGS	Human

Go to Top

QC of transcriptomic data

Sequencing data for RNA-Seq samples are adapter trimmed using Fastp and mapped against a reference transcriptome using splice aware aligner STAR. We follow the Alternate Protocol 7 from Mapping RNA-seq Reads with STAR and generate both genomic and transcriptomic bam files after aligning fastq files from different lanes and flowcells. Each lane level bams are tagged with relevant read group information before they are merged together to create a library level bam followed by marking the duplicate reads (without removal) using Picard tool. We generate raw counts per gene using the FeatureCounts tool from the merged genomic bam. Also, normalised counts are generated using RSEM tool from the merged transcriptome bam. A bigwig signal file is generated from the genomic bam file using STAR, following the Alternate Protocols 4 from the above mentioned publication. All the Picard and Samtools metrics are merged using MultiQC for a library level report.

Go to Top

QC of genomic data

Sequencing data for genomic samples are adapter trimmed using Fastp and mapped against a reference genome using alignment tool BWA. We follow GATK’s Data pre-processing for variant discovery document and post process the raw alignment. Each lane level bams are tagged with relevant read group information before they are merged to a library level bam, after which duplicate reads are marked. We generate Picard and Samtools metrics for the library level bam and combine them using MultiQC. For the epigenome assays (purple arrow), we generate few additional metrics using Phantompeakqualtools and deepTools and add them to the same MultiQC report. For WGS samples (orange arrow), we process the duplicate marked bams with GATK BaseRecalibrator and ApplyBQSR before generating per sample GVCF files using HaplotypeCaller tool.

Go to Top

QC of single cell transcriptomic data

Please check the Single Cell Transcriptome Analysis page for more details.

Software and version information

Go to Top

Command line

Read preprocessing

Adapter trimming

Tool name: Fastp


  fastp
    -a auto
    --in1 /path/input/sample.R1.fastq.gz
    --out1 /path/trimmed/sample.R1.fastq.gz
    --html /path/trimmed/sample.report.html
    --json /path/trimmed/sample.report.json
    --report_title sample
    --thread threads
    --in2 /path/input/sample.R2.fastq.gz
    --out2 /path/trimmed/sample.R2.fastq.gz
    --qualified_quality_phred=15
    --length_required=15
    --trim_poly_g         # for NEXTSEQ

Go to Top

Read alignment

Genomic DNA alignment

Tool name: BWA


  bwa
    mem
    -t threads 
    -M
    /path/bwa_ref_genome
    /path/trimmed/sample.R1.fastq.gz /path/trimmed/sample.R2.fastq.gz | samtools view -bo aligned.bam -

RNA-Seq alignment

Tool name: STAR


  STAR
    --runThreadN threads
    --outFileNamePrefix /path/mapped/sample
    --outSAMattributes NH HI AS NM MD
    --runMode alignReads
    --outFilterType BySJout
    --quantMode TranscriptomeSAM GeneCounts
    --sjdbGTFfile gene.gtf
    --genomeLoad NoSharedMemory
    --outSAMunmapped Within
    --outSAMtype BAM SortedByCoordinate
    --genomeDir /path/star_ref_transcriptome
    --outFilterMultimapNmax 20
    --alignIntronMin 20
    --alignSJDBoverhangMin 1
    --outFilterMismatchNoverReadLmax 0.04
    --alignMatesGapMax 1000000
    --limitBAMsortRAM 12000000000
    --outFilterMismatchNmax 999
    --alignIntronMax 1000000
    --alignSJoverhangMin 8
    --twopassMode Basic
    --readFilesCommand zcat
    --readFilesIn /path/trimmed/sample.R1.fastq.gz /path/trimmed/sample.R2.fastq.gz

Go to Top

Post-alignment processing

Add Read Group tags

Tool name: Picard AddOrReplaceReadGroups


  java
    -XX:ParallelGCThreads=threads
    -XmxMg
    -Djava.io.tmpdir=/path/temp
    -jar picard.jar
    AddOrReplaceReadGroups
    RGPL=PLATFORM
    RGPU=UNIQUE_RG_PU
    RGLB=LIBRARY_ID
    SORT_ORDER=coordinate # or unsorted for STAR transcriptome BAM
    RGSM=SAMPLE_ID
    RGCN=CENTER_NAME
    RGID=UNIQUE_RG_ID
    I=/path/mapped/sampleAligned.sortedByCoord.out.bam
    O=/path/mapped/sampleAligned.sortedByCoord.out.AddOrReplaceReadGroups.bam

Mark duplicate reads

Tool name: Picard Mark duplicates


  java
    -XX:ParallelGCThreads=thread
    -XmxMg
    -Djava.io.tmpdir=/path/temp
    -jar picard.jar
    MarkDuplicates
    O=/path/mapped/sample.genome.MarkDuplicates.bam
    M=/path/mapped/sample.genome.MarkDuplicates.summary.txt
    OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500                   # for HISEQ 4000 and NextSeq
    I=/path/mapped/sampleAligned.sortedByCoord.out.AddOrReplaceReadGroups.bam

Go to Top

Signal bigwig

Epigenome signal bigwig

Tool name: deepTools bamCoverage


  bamCoverage
    --bam filtered.bam
    --outFileFormat bigwig
    --outFileName bam_cov.bw
    --blackListFileName /ENCODE/ENCFF419RSJ.bed
    --numberOfProcessors threads

Transcriptome signal bigwig

Tool name: STAR bigwig


  STAR
    --runThreadN threads
    --runMode inputAlignmentsFromBAM
    --outFileNamePrefix /path/signal/sample
    --genomeLoad NoSharedMemory
    --outWigType bedGraph
    --outWigStrand Stranded
    --inputBAMfile /path/mapped/sample.genome.MarkDuplicates.bam

Go to Top

Transcriptome specific analysis

Expression quantification

Tool name: FeatureCounts


  featureCounts
    -a genes.gtf
    -o /path/output
    -T threads
    /path/mapped/sample.genome.MarkDuplicates.bam

Tool name: RSEM


  rsem-calculate-expression
    --quiet
    --no-bam-output
    --alignments
    --strandedness reverse
    --num-threads threads
    --ci-memory 4000
    --estimate-rspd
    --paired-end 
    /path/mapped/sample.merged.transcriptome.bam
    RSEM_REF
    /path/rsem/sample

Go to Top

Epigenome quality check

Filter alignment

Tool name: Samtools


  samtools view
   -q 20
   -F 1804   ## For PE, or 1796 for SE
   -bo filtered.bam
   raw_input.bam

Strand cross-correlation metrics

Tool name: Phantompeakqualtools


  Rscript phantompeakqualtools/run_spp.R
   -c=filtered.bam
   -rf
   -p=threads
   -savp=PPQT.spp.pdf
   -out=PPQT.spp.out
   -odir=/path/output

Sequence coverage

Tool name: deepTools plotCoverage


  plotCoverage
    -b filtered.bam
    -p threads
    --blackListFileName /ENCODE/ENCFF419RSJ.bed
    --outRawCounts coverage.tab > plot_cov.out

Epigenome quality fingerprint

Tool name: deepTools plotFingerprint


  plotFingerprint
    --bamfiles filtered.bam
    --outQualityMetrics fingerprint.m
    -p 4
    --blackListFileName /ENCODE/ENCFF419RSJ.bed
    --outRawCounts fingerprint.tab

Go to Top

WGS post alignment processing

GATK BaseRecalibrator


  gatk 
    BaseRecalibrator 
      -I /path/sample.MarkDuplicates.bam 
      -O /path/sample.MarkDuplicates.bam_BaseRecalibrator.table 
      --reference /path/fasta
      --java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
      --known-sites /path/dbsnp.vcf 
      --known-sites /path/known_indels.vcf.gz

GATK ApplyBQSR


  gatk
    ApplyBQSR
      --bqsr-recal-file /path/sample.MarkDuplicates.bam_BaseRecalibrator.table
      --create-output-bam-index
      --emit-original-quals
      -I /path/sample.MarkDuplicates.bam 
      -O /path/sample.MarkDuplicates_ApplyBQSR.bam 
      --reference /path/fasta 
      --java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'

GATK BaseRecalibrator (post ApplyBQSR)


  gatk
    BaseRecalibrator
      -I /path/sample.MarkDuplicates_ApplyBQSR.bam
      -O /path/sample.MarkDuplicates_ApplyBQSR.bam_BaseRecalibrator.table
      --reference /path/fasta
      --java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
      --known-sites /path/dnsnp.vcf
      --known-sites /path/known_indels.vcf.gz

GATK HaplotypeCaller GVCF


  gatk
    HaplotypeCaller
    -I /path/sample.MarkDuplicates_ApplyBQSR.bam
    -O /path/sample.MarkDuplicates_ApplyBQSR_HaplotypeCaller.g.vcf.gz
    --reference /path/fasta
    --dbsnp /path/dbsnp.vcf
    --java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
    --emit-ref-confidence GVCF

Go to Top

Output format

Alignment file

Cram: Per sample cram file. Chech the Reference genome datasets page for details about the reference genome build used for cram conversion.

Gene count files

FeatureCounts: Raw read counts per gene
RSEM: Normalised (TPM and RPKM) read counts per gene and isoforms

Signal files

STAR bigwig: Per sample bigwig signal file generated by STAR
deepTools bigwig: Per sample bigwig signal file generated by deepTools bamCoverage

Strand cross-correlation pdf

Phantompeakqualtools pdf: Strand cross-correlation plot with NSC and RSC metrics

GVCF

GATK HaplotypeCaller GVCF: Per sample GVCF file produced by GATK HaplotypeCaller

MultiQC report

A multiqc report for the alignment bam is produced (per sample) combining metrics from the following tools

Fastp
STAR (RNA-Seq)
Picard Mark Duplicates
Picard CollectAlignmentSummaryMetrics
Picard CollectBaseDistributionByCycle
Picard CollectGcBiasMetrics
Picard QualityScoreDistribution
Picard CollectRnaSeqMetrics (RNA-Seq)
Samtools idxstats
Samtools stats
FeatureCounts (RNA-Seq)
Phantompeakqualtools (Epigenome)
deepTools (Epigenome)
GATK BQSR (pre and post ApplyBQSR) (WGS)

Go to Top

List of resources

Go to Top

Change logs

29 January 2021
- Moved to Gencode v36 build from v30
- Updated Fastp from v0.19.8 to v0.20.1
20 December 2020
- Updated MultiQC from v 1.7 to v 1.9
03 January 2020
- Added WGS post alignment processing by GATK
- Added GATK HaplotypeCaller GVCF generation for WGS samples
25th June 2019
- Added epigenome specific metrics
- Added “-a auto” flag to Fastp command
- Moved to Gencode v30 build from v28
24th April 2019
- Updated MultiQC from v 1.6 to v 1.7
- Updated Fastp from v 0.19.3 to v 0.19.8
- Added metrics from Samtools stats and removed Samtools flagstat

Legacy Alignment QC (not supported from 2022)

Table of Contents

Overview

List of supported library type and species

QC of transcriptomic data

QC of genomic data

QC of single cell transcriptomic data

Software and version information

Command line

Read preprocessing

Adapter trimming

Read alignment

Genomic DNA alignment

RNA-Seq alignment

Post-alignment processing

Add Read Group tags

Mark duplicate reads

Signal bigwig

Epigenome signal bigwig

Transcriptome signal bigwig

Transcriptome specific analysis

Expression quantification

Epigenome quality check

Filter alignment

Strand cross-correlation metrics

Sequence coverage

Epigenome quality fingerprint

WGS post alignment processing

GATK BaseRecalibrator

GATK ApplyBQSR

GATK BaseRecalibrator (post ApplyBQSR)

GATK HaplotypeCaller GVCF

Output format

Alignment file

Gene count files

Signal files

Strand cross-correlation pdf

GVCF

MultiQC report

List of resources

Change logs