De-multiplexing Fastq

De-multiplexing of Illumina sequencing runs
- Software versions
De-multiplexing of Illumina sequencing runs using BCLConvert
De-multiplexing of Illumina sequencing runs using Bcl2Fastq
List of resources

De-multiplexing of Illumina sequencing runs

Illumina sequencing platforms generate binary BCL files for each run. These raw data files are picked up by genomic facility pipelines and processed for fastq file generation.

Go to Top

Software versions

Instrument name	Software name	Software version	Update date
NextSeq 2000	BCLConvert	v4.0.3	Nov 2022
NovaSeq 6000	BCLConvert	v4.0.3	Nov 2022
MiSeq	BCLConvert	v4.0.3	Nov 2022
NextSeq 2000	BCLConvert	v3.9.3	March 2022
NovaSeq 6000	BCLConvert	v3.9.3	May 2022
MiSeq	BCLConvert	v3.9.3	July 2022
NovaSeq 6000	Bcl2Fastq	v2.20	March 2021
NextSeq 500	Bcl2Fastq	v2.20	Feb 2018
MiSeq	Bcl2Fastq	v2.20	Feb 2018
HiSeq 4000	Bcl2Fastq	v2.20	Feb 2018
HiSeq 4000	Bcl2Fastq	v2.18	June 2017

Go to Top

De-multiplexing of Illumina sequencing runs using BCLConvert

BCLConvert tool is used for de-multiplexing data from the following sequening platforms

Illumina NextSeq 2000
Illumina NovaSeq 6000
Illumina MiSeq

Go to Top

BCLConvert process summary

Sequencing runs are processed using BCLConvert tool using the following steps:

Generate SampleSheet.csv file for the target sequencing run
Replace sample barcode ids for any single cell sample (prepared using 10X Genomics library prep kit) with the actual index barcode sequence
Group samples based on Project name, Lane id (if available) and Index barcode length
Edit SampleSheet.csv file and add correct OverrideCycles info for each of the sample group
Add following Settings in the SampleSheet.csv file
- CreateFastqForIndexReads,1
- MinimumTrimmedReadLength,8
- FastqCompressionFormat,gzip
- MaskShortReads,8
- TrimUMI,0 (Optional for samples with UMI barcodes)
Configure and run BCLConvert tool for each sample group separately using the edited SampleSheet.csv file
Merge multiple fastq files to one set of fastqs, if there are more than one set of index barcodes present for any single cell sample (mostly for single index 10X Genomics samples)
Generate HTML version of de-multiplexing report
Generate FastQC, Fastq Screen and MultiQC reports

Go to Top

BCLConvert command line

  bcl-convert
    --bcl-input-directory /path/sequencing_run
    --output-directory /path/output
    --sample-sheet /path/SampleSheet.csv
    --bcl-num-conversion-threads 4
    --bcl-num-compression-threads 2
    --bcl-num-decompression-threads 2
    --bcl-num-parallel-tiles 4
    --bcl-sampleproject-subdirectories true
    --strict-mode true
    --bcl-only-lane lane_id

Go to Top

BCLConvert samplesheet format

SampleSheet file should have the following info:

IGF de-multiplexing pipeline uses a v1 format SampleSheet file for the BCLConvert runs
SampleSheet should contains following Settings sections
- _OverrideCycles,CYCLE
- CreateFastqForIndexReads,1
- MinimumTrimmedReadLength,8
- FastqCompressionFormat,gzip
- MaskShortReads,8
- TrimUMI,0 (Optional for samples with UMI barcodes)
SampleSheet should contains the following Data columns
- Lane (optional)
- Sample_ID
- Sample_Name
- Sample_Plate
- Sample_Well
- I7_Index_ID
- index
- I5_Index_ID
- index2
- Sample_Project
- Description
No Settings for Adapter trimming gets added to SampleSheet by default (no adapter trimming by default)

Go to Top

BCLConvert output directory structure

Fastq files from the BCLConvert run are organized using the following directory structure

<ROOT DIR>/<SEQRUN DATE>/<FLOWCELL ID>/<LANE ID>/<INDEX GROUP>/<SAMPLE ID>/

ROOT DIR: It can be PROJECT_NAME/fastq or only PROJECT_NAME
SEQRUN DATE: Sequencing run date or PROJECT_NAME_FLOWCELL-ID_SEQRUN-DATE
FLOWCELL ID: Flowcell id from sequencing run
LANE ID: Lane id information for fastq files
INDEX GROUP: Sample barcode length and tag information
SAMPLE ID: IGF sample id

Illumina uses the follwing file name convension for output fastq files. For example: sample-id_S1_L001_R1_001.fastq.gz

sample-id : Sample ID provided in the samplesheet
S1 : Number of sample based on the sample order on the samplesheet
L001 : Lane number of the flowcell
R1 : The read. For e.g. R1 indicates Read 1 and R2 indicates Read 2 of a paired-end run
001 : Its always 001
.fastq.gz : File extension. Its a gzipped fastq file

Each SAMPLE ID level dir should contain the following fastq reads

R1 and R2 are the read 1 and read 2 (check more information about Illumina paired-end sequencing reads)
I1 and I2 are the index reads

Each INDEX GROUP level dir can optionally contain these following files

De-muliplexing html report dir (/Reports) with samplesheet file (for the lane)
Manifest file (md5_manifest.tsv) containing the md5 checksum of the fastq files

Go to Top

BCLConvert fastq file validation

Steps:

Download fastq files from Globus (if you are using Globus based transfer)
In Mac/Linux
- Open a terminal and cd to INDEX GROUP level directory
- Run md5sum for file validation, e.g.
grep -v file_path md5_manifest.tsv |md5sum -c
- Alternatively, use the following command for checking md5sum of fastqs file from all the flowcells


  cd ROOT_DIR
  for i in `find . -name md5_manifest.tsv|xargs dirname`; \
    do \
      cd $i; \
      grep -v md5 md5_manifest.tsv |md5sum -c|grep -v OK; \
      cd -; \
    done

De-multiplexing of Illumina sequencing runs using Bcl2Fastq

BCLConvert tool is used for de-multiplexing data from the following sequening platforms

Illumina HiSeq 4000
Illumina NextSeq 500
Illumina NovaSeq 6000 (until May 2022)
Illumina MiSeq (until May 2022)

Go to Top

Bcl2Fastq process summary

Illumina sequencing platforms generate binary BCL files for each run. These raw data files are picked up by genomic facility pipelines and processed for fastq file generation using software BCLConvert. A samplesheet file containing correct index barcode information is essential for this “de-multiplexing” process, in order to allocate fastq reads to the individual samples and filtering the artifacts present in the raw data.

Go to Top

Bcl2Fastq command line

Example Bcl2Fastq command.

  bcl2fastq
    --runfolder-dir /path/input 
    --sample-sheet /path/SampleSheet.csv 
    --output-dir /path/output 
    --reports-dir /path/Reports 
    --use-bases-mask BASES_MASK 
    --stats-dir /path/Stats 
    -p 2 
    --create-fastq-for-index-reads 
    -w 1 
    --barcode-mismatches 1 
    -r 1 
    --auto-set-to-zero-barcode-mismatches

Additionally, for the short read cycles (less than 22 bp for R1 or R2), default value for bcl2fastq param –mask-short-adapter-reads is modified with the length of the read cycle.

Go to Top

Bcl2Fastq samplesheet format

Samplesheets are plain text files, separated by commas, with name SampleSheet.csv. It is divided into multiple sections, which are marked by a line starting with a section label. Please check Illumina documentation for more details about the samplesheet file format specification.

SampleSheet should contains the following Data columns
- Lane (optional)
- Sample_ID
- Sample_Name
- Sample_Plate
- Sample_Well
- I7_Index_ID
- index
- I5_Index_ID
- index2
- Sample_Project
- Description

Go to Top

Bcl2Fastq adapter trimming setting

De-multiplexing pipeline is configured to trim Illumina generic adapters from the reads, with the default run settings.

Adapter name	Adapter Sequence
Adapter	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Go to Top

Bcl2Fastq fastq output

Fastq files can be accessed from our iRODS data distribution server. Please check the Data access page for more details on this topic. Following files are present in each of the lane level tar files

Fastq reads
- R1 and R2 are the read 1 and read 2 (check more information about Illumina paired-end sequencing reads)
- I1 and I2 are the index reads
Demuliplexing html report
Manifest file containing the md5 checksum of the fastq files
Samplesheet file (for the lane)

Go to Top

Bcl2Fastq fastq file name

Illumina uses the follwing file name convension for the output fastq files

For example: sample-name_S1_L001_R1_001.fastq.gz

sample-name : Name of the sample provided in the samplesheet
S1 : Number of sample based on the sample order on the samplesheet
L001 : Lane number of the flowcell
R1 : The read. For e.g. R1 indicates Read 1 and R2 indicates Read 2 of a paired-end run
001 : Its always 001
.fastq.gz : File extension. Its a gzipped fastq file

Please check the Illumina BCL2Fastq documentation for more information.

Go to Top

Bcl2Fastq fastq file validation

Steps:

Download tar files from iRODS server and extract (use 7zip for windows)
In Mac/Linux
- Open a terminal and cd to the top level dir (look for PROJECT_NAME_file_manifest.csv)
- Run md5sum for file validation
e.g
```
awk '{print $2"  " $1}' PROJECT_NAME_file_manifest.csv |grep -v file|md5sum -c
```

Go to Top

Bcl2Fastq de-multiplexing of single cell samples (10xgenomics)

De-multiplexing of single cell samples are done using the specific set of single cell barcodes following the 10xgenomics’s documentation.

Bcl2Fastq command line for single cell samples

Example Bcl2Fastq command.


  bcl2fastq 
    --runfolder-dir /path/input 
    --sample-sheet /path/SampleSheet.csv 
    --output-dir /path/output 
    --reports-dir /path/Reports 
    --use-bases-mask BASES_MASK 
    --stats-dir /path/Stats 
    -p 2 
    --create-fastq-for-index-reads 
    -w 1 
    --barcode-mismatches 1 
    -r 1 
    --auto-set-to-zero-barcode-mismatches
    --mask-short-adapter-reads=8
    --minimum-trimmed-read-length=8

Go to Top

List of resources

Go to Top

De-multiplexing Fastq

Table of Contents

De-multiplexing of Illumina sequencing runs

Software versions

De-multiplexing of Illumina sequencing runs using BCLConvert

BCLConvert process summary

BCLConvert command line

BCLConvert samplesheet format

BCLConvert output directory structure

BCLConvert fastq file validation

De-multiplexing of Illumina sequencing runs using Bcl2Fastq

Bcl2Fastq process summary

Bcl2Fastq command line

Bcl2Fastq samplesheet format

Bcl2Fastq adapter trimming setting

Bcl2Fastq fastq output

Bcl2Fastq fastq file name

Bcl2Fastq fastq file validation

Bcl2Fastq de-multiplexing of single cell samples (10xgenomics)

Bcl2Fastq command line for single cell samples

List of resources