De-multiplexing Fastq

Table of Contents

De-multiplexing of Illumina sequencing runs

Illumina sequencing platforms generate binary BCL files for each run. These raw data files are picked up by genomic facility pipelines and processed for fastq file generation.

Go to Top

Software versions

Instrument name Software name Software version Update date
NextSeq 2000 BCLConvert v4.0.3 Nov 2022
NovaSeq 6000 BCLConvert v4.0.3 Nov 2022
MiSeq BCLConvert v4.0.3 Nov 2022
NextSeq 2000 BCLConvert v3.9.3 March 2022
NovaSeq 6000 BCLConvert v3.9.3 May 2022
MiSeq BCLConvert v3.9.3 July 2022
NovaSeq 6000 Bcl2Fastq v2.20 March 2021
NextSeq 500 Bcl2Fastq v2.20 Feb 2018
MiSeq Bcl2Fastq v2.20 Feb 2018
HiSeq 4000 Bcl2Fastq v2.20 Feb 2018
HiSeq 4000 Bcl2Fastq v2.18 June 2017
Go to Top

De-multiplexing of Illumina sequencing runs using BCLConvert

BCLConvert tool is used for de-multiplexing data from the following sequening platforms

  • Illumina NextSeq 2000
  • Illumina NovaSeq 6000
  • Illumina MiSeq
Go to Top

BCLConvert process summary

Sequencing runs are processed using BCLConvert tool using the following steps:

  • Generate SampleSheet.csv file for the target sequencing run
  • Replace sample barcode ids for any single cell sample (prepared using 10X Genomics library prep kit) with the actual index barcode sequence
  • Group samples based on Project name, Lane id (if available) and Index barcode length
  • Edit SampleSheet.csv file and add correct OverrideCycles info for each of the sample group
  • Add following Settings in the SampleSheet.csv file
    • CreateFastqForIndexReads,1
    • MinimumTrimmedReadLength,8
    • FastqCompressionFormat,gzip
    • MaskShortReads,8
    • TrimUMI,0 (Optional for samples with UMI barcodes)
  • Configure and run BCLConvert tool for each sample group separately using the edited SampleSheet.csv file
  • Merge multiple fastq files to one set of fastqs, if there are more than one set of index barcodes present for any single cell sample (mostly for single index 10X Genomics samples)
  • Generate HTML version of de-multiplexing report
  • Generate FastQC, Fastq Screen and MultiQC reports
Go to Top

BCLConvert command line

  bcl-convert
    --bcl-input-directory /path/sequencing_run
    --output-directory /path/output
    --sample-sheet /path/SampleSheet.csv
    --bcl-num-conversion-threads 4
    --bcl-num-compression-threads 2
    --bcl-num-decompression-threads 2
    --bcl-num-parallel-tiles 4
    --bcl-sampleproject-subdirectories true
    --strict-mode true
    --bcl-only-lane lane_id
  
Go to Top

BCLConvert samplesheet format

SampleSheet file should have the following info:

  • IGF de-multiplexing pipeline uses a v1 format SampleSheet file for the BCLConvert runs
  • SampleSheet should contains following Settings sections
    • _OverrideCycles,CYCLE
    • CreateFastqForIndexReads,1
    • MinimumTrimmedReadLength,8
    • FastqCompressionFormat,gzip
    • MaskShortReads,8
    • TrimUMI,0 (Optional for samples with UMI barcodes)
  • SampleSheet should contains the following Data columns
    • Lane (optional)
    • Sample_ID
    • Sample_Name
    • Sample_Plate
    • Sample_Well
    • I7_Index_ID
    • index
    • I5_Index_ID
    • index2
    • Sample_Project
    • Description
  • No Settings for Adapter trimming gets added to SampleSheet by default (no adapter trimming by default)
Go to Top

BCLConvert output directory structure

Fastq files from the BCLConvert run are organized using the following directory structure

<ROOT DIR>/<SEQRUN DATE>/<FLOWCELL ID>/<LANE ID>/<INDEX GROUP>/<SAMPLE ID>/

  • ROOT DIR: It can be PROJECT_NAME/fastq or only PROJECT_NAME
  • SEQRUN DATE: Sequencing run date or PROJECT_NAME_FLOWCELL-ID_SEQRUN-DATE
  • FLOWCELL ID: Flowcell id from sequencing run
  • LANE ID: Lane id information for fastq files
  • INDEX GROUP: Sample barcode length and tag information
  • SAMPLE ID: IGF sample id

Illumina uses the follwing file name convension for output fastq files. For example: sample-id_S1_L001_R1_001.fastq.gz

  • sample-id : Sample ID provided in the samplesheet
  • S1 : Number of sample based on the sample order on the samplesheet
  • L001 : Lane number of the flowcell
  • R1 : The read. For e.g. R1 indicates Read 1 and R2 indicates Read 2 of a paired-end run
  • 001 : Its always 001
  • .fastq.gz : File extension. Its a gzipped fastq file

Each SAMPLE ID level dir should contain the following fastq reads

  • R1 and R2 are the read 1 and read 2 (check more information about Illumina paired-end sequencing reads)
  • I1 and I2 are the index reads

Each INDEX GROUP level dir can optionally contain these following files

  • De-muliplexing html report dir (/Reports) with samplesheet file (for the lane)
  • Manifest file (md5_manifest.tsv) containing the md5 checksum of the fastq files
Go to Top

BCLConvert fastq file validation

Steps:

  • Download fastq files from Globus (if you are using Globus based transfer)
  • In Mac/Linux
    • Open a terminal and cd to INDEX GROUP level directory
    • Run md5sum for file validation, e.g.

    grep -v file_path md5_manifest.tsv |md5sum -c

    • Alternatively, use the following command for checking md5sum of fastqs file from all the flowcells

  cd ROOT_DIR
  for i in `find . -name md5_manifest.tsv|xargs dirname`; \
    do \
      cd $i; \
      grep -v md5 md5_manifest.tsv |md5sum -c|grep -v OK; \
      cd -; \
    done
  

De-multiplexing of Illumina sequencing runs using Bcl2Fastq

BCLConvert tool is used for de-multiplexing data from the following sequening platforms

  • Illumina HiSeq 4000
  • Illumina NextSeq 500
  • Illumina NovaSeq 6000 (until May 2022)
  • Illumina MiSeq (until May 2022)
Go to Top

Bcl2Fastq process summary

Illumina sequencing platforms generate binary BCL files for each run. These raw data files are picked up by genomic facility pipelines and processed for fastq file generation using software BCLConvert. A samplesheet file containing correct index barcode information is essential for this “de-multiplexing” process, in order to allocate fastq reads to the individual samples and filtering the artifacts present in the raw data.

Go to Top

Bcl2Fastq command line

Example Bcl2Fastq command.

  bcl2fastq
    --runfolder-dir /path/input 
    --sample-sheet /path/SampleSheet.csv 
    --output-dir /path/output 
    --reports-dir /path/Reports 
    --use-bases-mask BASES_MASK 
    --stats-dir /path/Stats 
    -p 2 
    --create-fastq-for-index-reads 
    -w 1 
    --barcode-mismatches 1 
    -r 1 
    --auto-set-to-zero-barcode-mismatches

Additionally, for the short read cycles (less than 22 bp for R1 or R2), default value for bcl2fastq param –mask-short-adapter-reads is modified with the length of the read cycle.

Go to Top

Bcl2Fastq samplesheet format

Samplesheets are plain text files, separated by commas, with name SampleSheet.csv. It is divided into multiple sections, which are marked by a line starting with a section label. Please check Illumina documentation for more details about the samplesheet file format specification.

  • SampleSheet should contains the following Data columns
    • Lane (optional)
    • Sample_ID
    • Sample_Name
    • Sample_Plate
    • Sample_Well
    • I7_Index_ID
    • index
    • I5_Index_ID
    • index2
    • Sample_Project
    • Description
Go to Top

Bcl2Fastq adapter trimming setting

De-multiplexing pipeline is configured to trim Illumina generic adapters from the reads, with the default run settings.

Adapter name Adapter Sequence
Adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Go to Top

Bcl2Fastq fastq output

Fastq files can be accessed from our iRODS data distribution server. Please check the Data access page for more details on this topic. Following files are present in each of the lane level tar files

  • Fastq reads
    • R1 and R2 are the read 1 and read 2 (check more information about Illumina paired-end sequencing reads)
    • I1 and I2 are the index reads
  • Demuliplexing html report
  • Manifest file containing the md5 checksum of the fastq files
  • Samplesheet file (for the lane)
Go to Top
Bcl2Fastq fastq file name

Illumina uses the follwing file name convension for the output fastq files

For example: sample-name_S1_L001_R1_001.fastq.gz

  • sample-name : Name of the sample provided in the samplesheet
  • S1 : Number of sample based on the sample order on the samplesheet
  • L001 : Lane number of the flowcell
  • R1 : The read. For e.g. R1 indicates Read 1 and R2 indicates Read 2 of a paired-end run
  • 001 : Its always 001
  • .fastq.gz : File extension. Its a gzipped fastq file

Please check the Illumina BCL2Fastq documentation for more information.

Go to Top

Bcl2Fastq fastq file validation

Steps:

  • Download tar files from iRODS server and extract (use 7zip for windows)
  • In Mac/Linux
    • Open a terminal and cd to the top level dir (look for PROJECT_NAME_file_manifest.csv)
    • Run md5sum for file validation

    e.g

    awk '{print $2"  " $1}' PROJECT_NAME_file_manifest.csv |grep -v file|md5sum -c
    
Go to Top

Bcl2Fastq de-multiplexing of single cell samples (10xgenomics)

De-multiplexing of single cell samples are done using the specific set of single cell barcodes following the 10xgenomics’s documentation.

Bcl2Fastq command line for single cell samples

Example Bcl2Fastq command.


  bcl2fastq 
    --runfolder-dir /path/input 
    --sample-sheet /path/SampleSheet.csv 
    --output-dir /path/output 
    --reports-dir /path/Reports 
    --use-bases-mask BASES_MASK 
    --stats-dir /path/Stats 
    -p 2 
    --create-fastq-for-index-reads 
    -w 1 
    --barcode-mismatches 1 
    -r 1 
    --auto-set-to-zero-barcode-mismatches
    --mask-short-adapter-reads=8
    --minimum-trimmed-read-length=8
  
Go to Top

List of resources

Go to Top