De-multiplexing Fastq
Table of Contents
- De-multiplexing of Illumina sequencing runs
- De-multiplexing of Illumina sequencing runs using BCLConvert
- De-multiplexing of Illumina sequencing runs using Bcl2Fastq
- List of resources
De-multiplexing of Illumina sequencing runs
Illumina sequencing platforms generate binary BCL files for each run. These raw data files are picked up by genomic facility pipelines and processed for fastq file generation.
Software versions
Instrument name | Software name | Software version | Update date |
NextSeq 2000 | BCLConvert | v4.0.3 | Nov 2022 |
NovaSeq 6000 | BCLConvert | v4.0.3 | Nov 2022 |
MiSeq | BCLConvert | v4.0.3 | Nov 2022 |
NextSeq 2000 | BCLConvert | v3.9.3 | March 2022 |
NovaSeq 6000 | BCLConvert | v3.9.3 | May 2022 |
MiSeq | BCLConvert | v3.9.3 | July 2022 |
NovaSeq 6000 | Bcl2Fastq | v2.20 | March 2021 |
NextSeq 500 | Bcl2Fastq | v2.20 | Feb 2018 |
MiSeq | Bcl2Fastq | v2.20 | Feb 2018 |
HiSeq 4000 | Bcl2Fastq | v2.20 | Feb 2018 |
HiSeq 4000 | Bcl2Fastq | v2.18 | June 2017 |
De-multiplexing of Illumina sequencing runs using BCLConvert
BCLConvert tool is used for de-multiplexing data from the following sequening platforms
- Illumina NextSeq 2000
- Illumina NovaSeq 6000
- Illumina MiSeq
BCLConvert process summary
Sequencing runs are processed using BCLConvert tool using the following steps:
- Generate SampleSheet.csv file for the target sequencing run
- Replace sample barcode ids for any single cell sample (prepared using 10X Genomics library prep kit) with the actual index barcode sequence
- Group samples based on Project name, Lane id (if available) and Index barcode length
- Edit SampleSheet.csv file and add correct OverrideCycles info for each of the sample group
- Add following Settings in the SampleSheet.csv file
- CreateFastqForIndexReads,1
- MinimumTrimmedReadLength,8
- FastqCompressionFormat,gzip
- MaskShortReads,8
- TrimUMI,0 (Optional for samples with UMI barcodes)
- Configure and run BCLConvert tool for each sample group separately using the edited SampleSheet.csv file
- Merge multiple fastq files to one set of fastqs, if there are more than one set of index barcodes present for any single cell sample (mostly for single index 10X Genomics samples)
- Generate HTML version of de-multiplexing report
- Generate FastQC, Fastq Screen and MultiQC reports
BCLConvert command line
bcl-convert
--bcl-input-directory /path/sequencing_run
--output-directory /path/output
--sample-sheet /path/SampleSheet.csv
--bcl-num-conversion-threads 4
--bcl-num-compression-threads 2
--bcl-num-decompression-threads 2
--bcl-num-parallel-tiles 4
--bcl-sampleproject-subdirectories true
--strict-mode true
--bcl-only-lane lane_id
BCLConvert samplesheet format
SampleSheet file should have the following info:
- IGF de-multiplexing pipeline uses a v1 format SampleSheet file for the BCLConvert runs
- SampleSheet should contains following Settings sections
- _OverrideCycles,CYCLE
- CreateFastqForIndexReads,1
- MinimumTrimmedReadLength,8
- FastqCompressionFormat,gzip
- MaskShortReads,8
- TrimUMI,0 (Optional for samples with UMI barcodes)
- SampleSheet should contains the following Data columns
- Lane (optional)
- Sample_ID
- Sample_Name
- Sample_Plate
- Sample_Well
- I7_Index_ID
- index
- I5_Index_ID
- index2
- Sample_Project
- Description
- No Settings for Adapter trimming gets added to SampleSheet by default (no adapter trimming by default)
BCLConvert output directory structure
Fastq files from the BCLConvert run are organized using the following directory structure
<ROOT DIR>/<SEQRUN DATE>/<FLOWCELL ID>/<LANE ID>/<INDEX GROUP>/<SAMPLE ID>/
- ROOT DIR: It can be
PROJECT_NAME/fastq
or onlyPROJECT_NAME
- SEQRUN DATE: Sequencing run date or
PROJECT_NAME_FLOWCELL-ID_SEQRUN-DATE
- FLOWCELL ID: Flowcell id from sequencing run
- LANE ID: Lane id information for fastq files
- INDEX GROUP: Sample barcode length and tag information
- SAMPLE ID: IGF sample id
Illumina uses the follwing file name convension for output fastq files. For example: sample-id_S1_L001_R1_001.fastq.gz
- sample-id : Sample ID provided in the samplesheet
- S1 : Number of sample based on the sample order on the samplesheet
- L001 : Lane number of the flowcell
- R1 : The read. For e.g. R1 indicates Read 1 and R2 indicates Read 2 of a paired-end run
- 001 : Its always 001
- .fastq.gz : File extension. Its a gzipped fastq file
Each SAMPLE ID level dir should contain the following fastq reads
- R1 and R2 are the read 1 and read 2 (check more information about Illumina paired-end sequencing reads)
- I1 and I2 are the index reads
Each INDEX GROUP level dir can optionally contain these following files
- De-muliplexing html report dir (
/Reports
) with samplesheet file (for the lane) - Manifest file (
md5_manifest.tsv
) containing the md5 checksum of the fastq files
BCLConvert fastq file validation
Steps:
- Download fastq files from Globus (if you are using Globus based transfer)
- In Mac/Linux
- Open a terminal and
cd
to INDEX GROUP level directory - Run md5sum for file validation, e.g.
grep -v file_path md5_manifest.tsv |md5sum -c
- Alternatively, use the following command for checking md5sum of fastqs file from all the flowcells
- Open a terminal and
cd ROOT_DIR
for i in `find . -name md5_manifest.tsv|xargs dirname`; \
do \
cd $i; \
grep -v md5 md5_manifest.tsv |md5sum -c|grep -v OK; \
cd -; \
done
De-multiplexing of Illumina sequencing runs using Bcl2Fastq
BCLConvert tool is used for de-multiplexing data from the following sequening platforms
- Illumina HiSeq 4000
- Illumina NextSeq 500
- Illumina NovaSeq 6000 (until May 2022)
- Illumina MiSeq (until May 2022)
Bcl2Fastq process summary
Illumina sequencing platforms generate binary BCL files for each run. These raw data files are picked up by genomic facility pipelines and processed for fastq file generation using software BCLConvert. A samplesheet file containing correct index barcode information is essential for this “de-multiplexing” process, in order to allocate fastq reads to the individual samples and filtering the artifacts present in the raw data.
Bcl2Fastq command line
Example Bcl2Fastq command.
bcl2fastq
--runfolder-dir /path/input
--sample-sheet /path/SampleSheet.csv
--output-dir /path/output
--reports-dir /path/Reports
--use-bases-mask BASES_MASK
--stats-dir /path/Stats
-p 2
--create-fastq-for-index-reads
-w 1
--barcode-mismatches 1
-r 1
--auto-set-to-zero-barcode-mismatches
Additionally, for the short read cycles (less than 22 bp for R1 or R2), default value for bcl2fastq param –mask-short-adapter-reads is modified with the length of the read cycle.
Bcl2Fastq samplesheet format
Samplesheets are plain text files, separated by commas, with name SampleSheet.csv
. It is divided into multiple sections, which are marked by a line starting with a section label. Please check Illumina documentation for more details about the samplesheet file format specification.
- SampleSheet should contains the following Data columns
- Lane (optional)
- Sample_ID
- Sample_Name
- Sample_Plate
- Sample_Well
- I7_Index_ID
- index
- I5_Index_ID
- index2
- Sample_Project
- Description
Bcl2Fastq adapter trimming setting
De-multiplexing pipeline is configured to trim Illumina generic adapters from the reads, with the default run settings.
Adapter name | Adapter Sequence |
Adapter | AGATCGGAAGAGCACACGTCTGAACTCCAGTCA |
AdapterRead2 | AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT |
Bcl2Fastq fastq output
Fastq files can be accessed from our iRODS data distribution server. Please check the Data access page for more details on this topic. Following files are present in each of the lane level tar files
- Fastq reads
- R1 and R2 are the read 1 and read 2 (check more information about Illumina paired-end sequencing reads)
- I1 and I2 are the index reads
- Demuliplexing html report
- Manifest file containing the md5 checksum of the fastq files
- Samplesheet file (for the lane)
Bcl2Fastq fastq file name
Illumina uses the follwing file name convension for the output fastq files
For example: sample-name_S1_L001_R1_001.fastq.gz
- sample-name : Name of the sample provided in the samplesheet
- S1 : Number of sample based on the sample order on the samplesheet
- L001 : Lane number of the flowcell
- R1 : The read. For e.g. R1 indicates Read 1 and R2 indicates Read 2 of a paired-end run
- 001 : Its always 001
- .fastq.gz : File extension. Its a gzipped fastq file
Please check the Illumina BCL2Fastq documentation for more information.
Bcl2Fastq fastq file validation
Steps:
- Download tar files from iRODS server and extract (use 7zip for windows)
- In Mac/Linux
- Open a terminal and
cd
to the top level dir (look forPROJECT_NAME_file_manifest.csv
) - Run md5sum for file validation
e.g
awk '{print $2" " $1}' PROJECT_NAME_file_manifest.csv |grep -v file|md5sum -c
- Open a terminal and
Bcl2Fastq de-multiplexing of single cell samples (10xgenomics)
De-multiplexing of single cell samples are done using the specific set of single cell barcodes following the 10xgenomics’s documentation.
Bcl2Fastq command line for single cell samples
Example Bcl2Fastq command.
bcl2fastq
--runfolder-dir /path/input
--sample-sheet /path/SampleSheet.csv
--output-dir /path/output
--reports-dir /path/Reports
--use-bases-mask BASES_MASK
--stats-dir /path/Stats
-p 2
--create-fastq-for-index-reads
-w 1
--barcode-mismatches 1
-r 1
--auto-set-to-zero-barcode-mismatches
--mask-short-adapter-reads=8
--minimum-trimmed-read-length=8