-
Analysis of whole human DNA sequence data
For the analysis of whole human DNA sequence data, we recommend the wf-human-variation workflow. This end-to-end software pipeline is implemented using the Nextflow workflow language and implements methods for the calling of single nucleotide polymorphisms (SNPs), structural variants (SVs), and for reporting DNA methylation information.
The wf-human-variation workflow is best run from the BAM file produced by MinKNOW when the modified base model for basecalling is selected. If sequence read mapping to the reference genome is not performed by MinKNOW, the analysis workflow will automatically perform the read mapping when provided with the reference sequence and will store the mapping data in a CRAM format output file.
The three models below are used in the analysis workflow and can be run in isolation or together:
Sniffles2 calls SVs and file output include an HTML report of QC metrics and VCF format list of variants and their quality scores.
Clair3 calls SNPs and file output includes an HTML report of QC metrics and VCF format list of variants and their quality scores.
modkit extracts methylation information from the provided BAM file which is summarised in a BED format file.
The wf-human-variation workflow is preconfigured using appropriate parameters and requires tuning only for the choice of reference genome and Clair3 model. Please see the project’s documentation for further details.
The results from the wf-human-variation workflow can be further explored by viewing in a track-based genome browser such as IGV or JBrowse2 or can be assessed for known pathogenicity through tertiary analysis software.
-
EPI2ME analysis workflow
The wf-human-variation workflow is intended to be run from the Nextflow software at the command line. For users who prefer to interact with software through Graphical User Interfaces (GUI), the EPI2ME software provides a simplified user interface where analysis runs can be specified, configured, and run.
For new users, the quick start guide can be found here outlining how to use this interface.
How to run the workflow on EPI2ME
Set up:
To run the workflow on EPI2ME, ensure Nextflow is installed to manage compute and software resources, alongside either Docker or Singularity.Running the wf-human-variation workflow:
To test the workflow, users can run demonstration data using the workflow.- Obtain the workflow and the available options, use the following command:
nextflow run epi2me-labs/wf-human-variation --help
2. Test the workflow software using the demonstration data and the following command:wget -O demo_data.tar.gz \ https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-human-variation/demo_data.tar.gz tar -xzvf demo_data.tar.gz
3. Next, the models use subworkflows in the analysis workflow to be run together or in isolation with the following command line options:- Basecalling:
--fast5_dir <input_dir>
- SNP calling:
--snp
- SV calling:
--sv
- To specify tandem repeats in the reference sequence with SV calling to improve calling:
--tr_bed
- To specify tandem repeats in the reference sequence with SV calling to improve calling:
- Methylation aggregation:
--methyl
- For 5mC aggregation, ensure the modified bases option and 5mC basecaller model was selected during the MinKNOW set up. If not, the data will need to be re-basecalled.
The subworkflows will only run when the relevant command line option is used. When omitted, the subworkflow will not run.
4. To activate the basecalling workflow with all the subworkflows, use all the above command line options as follows:OUTPUT=output nextflow run epi2me-labs/wf-human-variation \ -w ${OUTPUT}/workspace \ -profile standard \ --snp --sv --methyl \ --fast5_dir path/to/fast5/dir \ --basecaller_cfg 'dna_r10.4.1_e8.2_400bps_hac@v3.5.2' \ --remora_cfg 'dna_r10.4.1_e8.2_400bps_hac@v3.5.2_5mCG@v2' \ --bed path/to.bed \ --ref path/to.fasta \ --out_dir ${OUTPUT}
Workflow outputs
The primary workflow outputs include:
- gzipped VCF file containing the SNPs in the dataset from
--snp
- gzipped VCF file containing the SVs in the dataset from
--sv
- gzipped bedMethyl file aggregating modified CpG base counts from
--methyl
- HTML report detailing the primary findings of the workflow for SNP and SV calling
- For basecalling and alignment, the workflow will output two sorted, indexed CRAMS of basecalls aligned to the provided references, with reads separated by their quality score:
<sample_name>.pass.cram
The above contains reads with qscore ">= threshold" (only pass reads are used to make downstream variant calls)
<sample_name>.fail.cram
The above contains reads with "< threshold"- If unaligned BAM file was provided, the workflow will ouput a CRAM file containing the alignments used to make the downstream variant calls.
The secondary workflow outputs include:{sample_name}.mapula.csv
and{sample_name}.mapula.json
provide basic alignment metrics such as primary and secondary counts, read N50 and median accuracymosdepth
ouputs include:{sample_name}.mosdepth.global.dist.txt
: a cumulative distribution indicating the proportion of total bases for each and all reference sequences{sample_name}.regions.bed.gz
: the mean coverage for each region in the provided BED file{sample_name}.thresholds.bed.gz
: the number of bases in each region that are covered at or above each threshold value (1, 10, 20, 30X){sample_name}.readstats.tsv.gz
: a gzipped TSV summarising per-alignment statistics produced by bamstats
Workflow tips:
Users familiar with
wf-human-snp
andwf-human-sv
are recommended to familiarise themselves with any parameter changes by using--help
as there are slight differences between the workflows. For example, all arms of the variation calling workflow uses--ref
rather than--reference
and--bed
rather than--target_bedfile
.To improve the accuracy of SV calling, specify a suitable tandem repeat BED for your reference with
--tr_bed
.Aggregation of methylation calls with
--methyl
requires data to be basecalled with a model that includes base modifications, providing theMM
andML
BAM tags. To do so on MinKNOW, ensure 'Modified bases' option is selected during basecalling set up, with the '5mC' model selected.Ensure to retain the input reference when basecalling or alignment is performed as CRAM files cannot be read without the corresponding input reference.
For a full list of available basecalling models, refer to the Dorado documentation.