-
Analysis of whole human DNA sequence data
For the analysis of whole human DNA sequence data, we recommend the wf-human-variation workflow. This end-to-end software pipeline is implemented using the Nextflow workflow language and implements methods for the calling of single nucleotide polymorphisms (SNPs), structural variants (SVs), and for reporting DNA methylation information.
The wf-human-variation workflow is best run from the BAM file produced by MinKNOW when the modified base model for basecalling is selected. If sequence read mapping to the reference genome is not performed by MinKNOW, we recommend to perform the basecalling using the wf-basecalling workflow. Ensure you save the outputs in BAM format by providing the --output_bam option.
The tools below are used in the analysis workflow and can be run in isolation or together:
Sniffles2 calls SVs and file output include an HTML report of QC metrics and VCF format list of variants and their quality scores.
Clair3 calls SNPs and file output includes an HTML report of QC metrics and VCF format list of variants and their quality scores.
modkit extracts methylation information from the provided BAM file which is summarised in a BEDmethyl format file.
The wf-human-variation workflow is preconfigured using appropriate parameters and requires tuning only for the choice of reference genome and Clair3 model. Please see the project’s documentation for further details.
The results from the wf-human-variation workflow can be further explored by viewing in a track-based genome browser such as IGVcan be assessed for known pathogenicity through tertiary analysis software.
-
EPI2ME analysis workflow
The wf-human-variation workflow is intended to be run from the Nextflow software at the command line.
For new users or users who prefer to interact with software through Graphical User Interfaces (GUI), we recommend using the EPI2ME software. This provides a simplified user interface where analysis runs can be specified, configured, and run.
For new users, the quick start guide can be found here outlining how to use this interface.
-
Command line interface (CLI) analysis workflow
How to run the workflow from command line interface (CLI)
Set up:
To run the workflow from CLI, ensure Nextflow is installed to manage compute and software resources, alongside either Docker or Singularity.Running the wf-human-variation workflow:
To test the workflow, you can run demonstration data using the workflow.1. Obtain the workflow and the available options, use the following command:
nextflow run epi2me-labs/wf-human-variation --help
2. Test the workflow software using the demonstration data and the following command:
wget -O demo_data.tar.gz \ https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-human-variation/demo_data.tar.gz tar -xzvf demo_data.tar.gz
3. Next, the models use subworkflows in the analysis workflow to be run together or in isolation with the following command line options:
- SNP calling:
--snp
- SV calling:
--sv
- To specify tandem repeats in the reference sequence with SV calling to improve calling:
--tr_bed
- To specify tandem repeats in the reference sequence with SV calling to improve calling:
- Methylation aggregation:
--mod
- For 5mC aggregation, ensure the modified bases option and 5mC basecaller model was selected during the MinKNOW set up. If not, the data will need to be re-basecalled.
The subworkflows will only run when the relevant command line option is used. When omitted, the subworkflow will not run.
4. To activate the variant calling workflow with all the subworkflows, use all the command line options previously mentioned as follows:
OUTPUT=output nextflow run epi2me-labs/wf-human-variation \ -w ${OUTPUT}/workspace \ -profile standard \ --snp --sv --mod \ --bam path/to/input.bam \ --bed path/to.bed \ --ref path/to.fasta \ --out._dir ${OUTPUT}
- SNP calling:
-
wf-human-variation workflow outputs
The primary workflow outputs include:
- gzipped VCF file containing the SNPs in the dataset from
--snp
- gzipped VCF file containing the SVs in the dataset from
--sv
- gzipped bedMethyl file aggregating modified base counts from
--mod
- HTML report detailing the primary findings of the workflow for QC metrics, and SNP and SV calling
- If an unaligned BAM file was provided, the workflow will ouput a CRAM file containing the alignments used to make the downstream variant calls.
The secondary workflow outputs:
mosdepth
ouputs include:{sample_name}.mosdepth.global.dist.txt
: a cumulative distribution indicating the proportion of total bases for each and all reference sequences{sample_name}.regions.bed.gz
: the mean coverage for each region in the provided BED file{sample_name}.thresholds.bed.gz
: the number of bases in each region that are covered at or above each threshold value (1, 10, 20, 30X)
- bamstats ouputs include:
{sample_name}.readstats.tsv.gz
: a gzipped TSV summarising per-alignment statistics produced by bamstats{sample_name}.ftagstat.tsv
: a text file with summary alignment statistics for each reference sequence
- gzipped VCF file containing the SNPs in the dataset from
-
wf-human-variation workflow tips
It is possible to phase SNPs, SVs and modified bases by providing the
--phased
option.To improve the accuracy of SV calling, specify a suitable tandem repeat BED for your reference with
--tr_bed
.Aggregation of methylation calls with
--mod
requires data to be basecalled with a model that includes base modifications, providing theMM
andML
BAM tags. To do so on MinKNOW, ensure 'Modified bases' option is selected during basecalling set up, with the '5mC' model selected.Ensure to retain the input reference when basecalling or alignment is performed as CRAM files cannot be read without the corresponding input reference.
For a full list of available basecalling models, refer to the Dorado documentation.