Legacy Alignment QC (not supported from 2022)

Please check out help documentation of new analysis pipelines.

Table of Contents

Overview

We process sequencing data through our QC pipeline and generate alignment quality report for a variety of assay types. A list of supported library types and species information can be found below.

List of supported library type and species

Library type Species
PolyA-RNA / Total-RNA Human
ChIP-Seq Human
ATAC-Seq Human
Single cell 3' RNA (10X Genomics) Human
WGS Human
Go to Top

QC of transcriptomic data

Sequencing data for RNA-Seq samples are adapter trimmed using Fastp and mapped against a reference transcriptome using splice aware aligner STAR. We follow the Alternate Protocol 7 from Mapping RNA-seq Reads with STAR and generate both genomic and transcriptomic bam files after aligning fastq files from different lanes and flowcells. Each lane level bams are tagged with relevant read group information before they are merged together to create a library level bam followed by marking the duplicate reads (without removal) using Picard tool. We generate raw counts per gene using the FeatureCounts tool from the merged genomic bam. Also, normalised counts are generated using RSEM tool from the merged transcriptome bam. A bigwig signal file is generated from the genomic bam file using STAR, following the Alternate Protocols 4 from the above mentioned publication. All the Picard and Samtools metrics are merged using MultiQC for a library level report.

Go to Top

QC of genomic data

Sequencing data for genomic samples are adapter trimmed using Fastp and mapped against a reference genome using alignment tool BWA. We follow GATK’s Data pre-processing for variant discovery document and post process the raw alignment. Each lane level bams are tagged with relevant read group information before they are merged to a library level bam, after which duplicate reads are marked. We generate Picard and Samtools metrics for the library level bam and combine them using MultiQC. For the epigenome assays (purple arrow), we generate few additional metrics using Phantompeakqualtools and deepTools and add them to the same MultiQC report. For WGS samples (orange arrow), we process the duplicate marked bams with GATK BaseRecalibrator and ApplyBQSR before generating per sample GVCF files using HaplotypeCaller tool.

Go to Top

QC of single cell transcriptomic data

Please check the Single Cell Transcriptome Analysis page for more details.

Software and version information

Go to Top

Command line

Read preprocessing

Adapter trimming

Tool name: Fastp


  fastp
    -a auto
    --in1 /path/input/sample.R1.fastq.gz
    --out1 /path/trimmed/sample.R1.fastq.gz
    --html /path/trimmed/sample.report.html
    --json /path/trimmed/sample.report.json
    --report_title sample
    --thread threads
    --in2 /path/input/sample.R2.fastq.gz
    --out2 /path/trimmed/sample.R2.fastq.gz
    --qualified_quality_phred=15
    --length_required=15
    --trim_poly_g         # for NEXTSEQ
  
Go to Top

Read alignment

Genomic DNA alignment

Tool name: BWA


  bwa
    mem
    -t threads 
    -M
    /path/bwa_ref_genome
    /path/trimmed/sample.R1.fastq.gz /path/trimmed/sample.R2.fastq.gz | samtools view -bo aligned.bam -
  

RNA-Seq alignment

Tool name: STAR


  STAR
    --runThreadN threads
    --outFileNamePrefix /path/mapped/sample
    --outSAMattributes NH HI AS NM MD
    --runMode alignReads
    --outFilterType BySJout
    --quantMode TranscriptomeSAM GeneCounts
    --sjdbGTFfile gene.gtf
    --genomeLoad NoSharedMemory
    --outSAMunmapped Within
    --outSAMtype BAM SortedByCoordinate
    --genomeDir /path/star_ref_transcriptome
    --outFilterMultimapNmax 20
    --alignIntronMin 20
    --alignSJDBoverhangMin 1
    --outFilterMismatchNoverReadLmax 0.04
    --alignMatesGapMax 1000000
    --limitBAMsortRAM 12000000000
    --outFilterMismatchNmax 999
    --alignIntronMax 1000000
    --alignSJoverhangMin 8
    --twopassMode Basic
    --readFilesCommand zcat
    --readFilesIn /path/trimmed/sample.R1.fastq.gz /path/trimmed/sample.R2.fastq.gz
  
Go to Top

Post-alignment processing

Add Read Group tags

Tool name: Picard AddOrReplaceReadGroups


  java
    -XX:ParallelGCThreads=threads
    -XmxMg
    -Djava.io.tmpdir=/path/temp
    -jar picard.jar
    AddOrReplaceReadGroups
    RGPL=PLATFORM
    RGPU=UNIQUE_RG_PU
    RGLB=LIBRARY_ID
    SORT_ORDER=coordinate # or unsorted for STAR transcriptome BAM
    RGSM=SAMPLE_ID
    RGCN=CENTER_NAME
    RGID=UNIQUE_RG_ID
    I=/path/mapped/sampleAligned.sortedByCoord.out.bam
    O=/path/mapped/sampleAligned.sortedByCoord.out.AddOrReplaceReadGroups.bam
  

Mark duplicate reads

Tool name: Picard Mark duplicates


  java
    -XX:ParallelGCThreads=thread
    -XmxMg
    -Djava.io.tmpdir=/path/temp
    -jar picard.jar
    MarkDuplicates
    O=/path/mapped/sample.genome.MarkDuplicates.bam
    M=/path/mapped/sample.genome.MarkDuplicates.summary.txt
    OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500                   # for HISEQ 4000 and NextSeq
    I=/path/mapped/sampleAligned.sortedByCoord.out.AddOrReplaceReadGroups.bam
  
Go to Top

Signal bigwig

Epigenome signal bigwig

Tool name: deepTools bamCoverage


  bamCoverage
    --bam filtered.bam
    --outFileFormat bigwig
    --outFileName bam_cov.bw
    --blackListFileName /ENCODE/ENCFF419RSJ.bed
    --numberOfProcessors threads
  

Transcriptome signal bigwig

Tool name: STAR bigwig


  STAR
    --runThreadN threads
    --runMode inputAlignmentsFromBAM
    --outFileNamePrefix /path/signal/sample
    --genomeLoad NoSharedMemory
    --outWigType bedGraph
    --outWigStrand Stranded
    --inputBAMfile /path/mapped/sample.genome.MarkDuplicates.bam
  
Go to Top

Transcriptome specific analysis

Expression quantification

Tool name: FeatureCounts


  featureCounts
    -a genes.gtf
    -o /path/output
    -T threads
    /path/mapped/sample.genome.MarkDuplicates.bam
  

Tool name: RSEM


  rsem-calculate-expression
    --quiet
    --no-bam-output
    --alignments
    --strandedness reverse
    --num-threads threads
    --ci-memory 4000
    --estimate-rspd
    --paired-end 
    /path/mapped/sample.merged.transcriptome.bam
    RSEM_REF
    /path/rsem/sample
  
Go to Top

Epigenome quality check

Filter alignment

Tool name: Samtools


  samtools view
   -q 20
   -F 1804   ## For PE, or 1796 for SE
   -bo filtered.bam
   raw_input.bam
  

Strand cross-correlation metrics

Tool name: Phantompeakqualtools


  Rscript phantompeakqualtools/run_spp.R
   -c=filtered.bam
   -rf
   -p=threads
   -savp=PPQT.spp.pdf
   -out=PPQT.spp.out
   -odir=/path/output
  

Sequence coverage

Tool name: deepTools plotCoverage


  plotCoverage
    -b filtered.bam
    -p threads
    --blackListFileName /ENCODE/ENCFF419RSJ.bed
    --outRawCounts coverage.tab > plot_cov.out
  

Epigenome quality fingerprint

Tool name: deepTools plotFingerprint


  plotFingerprint
    --bamfiles filtered.bam
    --outQualityMetrics fingerprint.m
    -p 4
    --blackListFileName /ENCODE/ENCFF419RSJ.bed
    --outRawCounts fingerprint.tab
  
Go to Top

WGS post alignment processing

GATK BaseRecalibrator


  gatk 
    BaseRecalibrator 
      -I /path/sample.MarkDuplicates.bam 
      -O /path/sample.MarkDuplicates.bam_BaseRecalibrator.table 
      --reference /path/fasta
      --java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
      --known-sites /path/dbsnp.vcf 
      --known-sites /path/known_indels.vcf.gz
  

GATK ApplyBQSR


  gatk
    ApplyBQSR
      --bqsr-recal-file /path/sample.MarkDuplicates.bam_BaseRecalibrator.table
      --create-output-bam-index
      --emit-original-quals
      -I /path/sample.MarkDuplicates.bam 
      -O /path/sample.MarkDuplicates_ApplyBQSR.bam 
      --reference /path/fasta 
      --java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
  

GATK BaseRecalibrator (post ApplyBQSR)


  gatk
    BaseRecalibrator
      -I /path/sample.MarkDuplicates_ApplyBQSR.bam
      -O /path/sample.MarkDuplicates_ApplyBQSR.bam_BaseRecalibrator.table
      --reference /path/fasta
      --java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
      --known-sites /path/dnsnp.vcf
      --known-sites /path/known_indels.vcf.gz
  

GATK HaplotypeCaller GVCF


  gatk
    HaplotypeCaller
    -I /path/sample.MarkDuplicates_ApplyBQSR.bam
    -O /path/sample.MarkDuplicates_ApplyBQSR_HaplotypeCaller.g.vcf.gz
    --reference /path/fasta
    --dbsnp /path/dbsnp.vcf
    --java-options '-XX:ParallelGCThreads=1 -Xmx8g -Djava.io.tmpdir=/tmpdir'
    --emit-ref-confidence GVCF
  
Go to Top

Output format

Alignment file

  • Cram: Per sample cram file. Chech the Reference genome datasets page for details about the reference genome build used for cram conversion.

Gene count files

  • FeatureCounts: Raw read counts per gene
  • RSEM: Normalised (TPM and RPKM) read counts per gene and isoforms

Signal files

  • STAR bigwig: Per sample bigwig signal file generated by STAR
  • deepTools bigwig: Per sample bigwig signal file generated by deepTools bamCoverage

Strand cross-correlation pdf

  • Phantompeakqualtools pdf: Strand cross-correlation plot with NSC and RSC metrics

GVCF

  • GATK HaplotypeCaller GVCF: Per sample GVCF file produced by GATK HaplotypeCaller

MultiQC report

A multiqc report for the alignment bam is produced (per sample) combining metrics from the following tools

  • Fastp
  • STAR (RNA-Seq)
  • Picard Mark Duplicates
  • Picard CollectAlignmentSummaryMetrics
  • Picard CollectBaseDistributionByCycle
  • Picard CollectGcBiasMetrics
  • Picard QualityScoreDistribution
  • Picard CollectRnaSeqMetrics (RNA-Seq)
  • Samtools idxstats
  • Samtools stats
  • FeatureCounts (RNA-Seq)
  • Phantompeakqualtools (Epigenome)
  • deepTools (Epigenome)
  • GATK BQSR (pre and post ApplyBQSR) (WGS)
Go to Top

List of resources

Go to Top

Change logs

  • 29 January 2021
    • Moved to Gencode v36 build from v30
    • Updated Fastp from v0.19.8 to v0.20.1
  • 20 December 2020
    • Updated MultiQC from v 1.7 to v 1.9
  • 03 January 2020
    • Added WGS post alignment processing by GATK
    • Added GATK HaplotypeCaller GVCF generation for WGS samples
  • 25th June 2019
    • Added epigenome specific metrics
    • Added “-a auto” flag to Fastp command
    • Moved to Gencode v30 build from v28
  • 24th April 2019
    • Updated MultiQC from v 1.6 to v 1.7
    • Updated Fastp from v 0.19.3 to v 0.19.8
    • Added metrics from Samtools stats and removed Samtools flagstat