Community

Duplex basecalling

Note: We recommend using our Dorado basecaller to perform duplex basecalling. For more information, please see the Dorado page on Github or the "Basecalling Kit 14 duplex data" in the Kit 14 Sequencing and Duplex Basecalling info sheet.

The Guppy toolkit now supports performing duplex basecalling, where the template and complement strands of a read can have their basecall data combined to provide a more accurate sequence. To perform duplex basecalling, the template and complement read pairs must first be identified.

The guppy_basecaller_duplex tool currently provides two options for pairing reads:
- from_pair_list will instruct the Guppy duplex basecaller to use a text file containing read information indicating the source reads to be paired. This file may contain either two or eight whitespace-separated columns per line. The first two columns are the read ids of the reads to be duplexed. If the two reads are the results of a read being split, additional pairs of columns should be present specifying the parent read id, start time and duration (in seconds) of the read segments.
- from_1d_summary will instruct the Guppy duplex basecaller to read in a previously-generated Guppy 1D basecall summary file. This file will be used to identify pairs of reads which were sequenced through the same flow cell and channel in rapid sucession, marking them as potential pairs. Split reads are handled automatically.
The guppy_basecaller_duplex executable can be launched as follows:

guppy_basecaller_duplex --input_path <path to reads> --save_path <output folder> -x "cuda:0" --config dna_r10.4.1_e8.2_400bps_sup.cfg --duplex_pairing_mode from_pair_list --duplex_pairing_file <text pair file>

Note that duplex basecalling is very resource-intensive (especially when using the highest accuracy models), so it is strongly recommended to use GPU mode basecalling if possible.

For further information on duplex basecalling, please see our Duplex Tools page on Github.
Additional arguments

In addition to the arguments supported by the 1D basecaller, guppy_basecaller_duplex supports these additional arguments:
- duplex_pairing_mode: The read pairing mode to use for duplex basecalling. Must be one of 'from_1d_summary' or 'from_pair_list'. This argument must be specified.
- duplex_pairing_file: The input filename to use for duplex pairing. Must be a list of pairs of read ids for from_pair_list pairing, or a Guppy sequencing_summary file for from_1d_summary pairing. This argument must be specified.
Note that when performing duplex basecalling, the sequencing summary will have different columns available. The following columns will have different information to their meaning in 1D basecalls:
- read_id: The uuid that uniquely identifies the template source strand of this duplex read.
- filename: The name of the input read file which the template read came from.

The following column will be added:
- duplex_pair_read_id: The uuid that uniquely identifies the complement source strand of this duplex read.

Duplex basecalling is still in prototype support in Guppy, and there are some limitations to be aware of:

There is a maximum read size which is supported for duplex calling on GPU, based on the available device memory. It can be controlled by setting the --chunks_per_runner option for the duplex basecaller. To obtain best runtime performance, it is currently recommended to use the highest possible setting for --chunks_per_runner which the device can support. Here are some recommendations - these will need to be adjusted down by the user if they are doing other work on the GPU (such as using Guppy for barcoding):

Available GPU memory	--chunks_per_runner setting	Approximate maximum duplex read length
40 GB (e.g. A100)	1200	400 kb
32 GB (e.g. GV100)	900	300 kb
16 GB (e.g. V100)	450	150 kb
12 GB (e.g. GTX 1080 Ti)	320	106 kb

This limit on read length will be removed in future releases.

Duplex pipeline

ont_guppy_duplex_pipeline is a Python module that performs the sequence of processes required for duplex basecalling with Guppy. It is available on PyPI and can be installed via pip:

pip install ont-guppy-duplex-pipeline

The duplex pipeline comprises the following steps:
1. (Optional) simplex (1D) basecalling.
2. Identification of duplex pairs in the simplex basecall results.
3. (Optional) duplex basecalling of those pairs.
4. (Optional) simplex basecalling of all reads that were not part of a duplex pair, using the same configuration as the duplex basecalling.
To process reads with the duplex pipeline, call guppy_duplex with the required parameters:

guppy_duplex -i <read_folder> -s <output_folder>

The guppy_duplex pipeline also supports a number of optional parameters in addition to those required above.

Executables
- Path to the basecaller executable (--basecaller_exe): If this is not set, the pipeline assumes that guppy_basecaller is already in the path.
- Path to the duplex basecaller executable (--duplex_basecaller_exe): If this is not set, the pipeline assumes that guppy_basecaller_duplex is already in the path.
Input/output
- Recursive (-r or --recursive): Flag to require searching through all subfolders contained in the --input_path value, and basecall any files found in them.
Processing steps
- Skip simplex (--skip_simplex): Skip the initial simplex basecall step. The pipeline will assume that the sequencing_summary.txt file exists.
- Skip duplex (--skip_duplex): Skip the duplex basecall step.
- Non-duplex reads (--call_non_duplex_reads): When specified, this flag runs an additional basecall (using the same configuration as the duplex basecall) on reads that did not participate in a duplex pair. As the duplex step is typically performed using a higher-accuracy configuration that the simplex step, this provides for high accuracy basecalls of the non-duplex reads as part of the pipeline.
- Read splitting (--do_read_splitting): Perform read splitting during the initial simplex basecall to separate potentially concatenated reads.
Configurations
- Simplex configuration (--simplex_config): Sets the configuration to use for the initial simplex basecall. This will typically be a "fast" model. Defaults to dna_r10.4.1_e8.2_400bps_fast.cfg.
- Duplex configuration (--duplex_config): Sets the configuration to use for the duplex basecall. This will typically be a "hac" or "sup" model. This configuration will also be used to call the non-duplex reads if --call_non_duplex_reads is set. Defaults to dna_r10.4.1_e8.2_400bps_sup.cfg.
Optimisation
- GPU device (-d or --device): Specify the CUDA-enabled GPU to use to perform basecalling. See the Optimisation section under "Guppy features, settings and analysis" for more details.
  - Duplex chunks per runner (--duplex_chunks_per_runner): Passed to the Guppy duplex basecaller when performing duplex basecalling. Decrease this value in case of out-of-memory errors.
Other
- Disable logging (--disable_logging): Turns of logging.

Discover nanopore sequencing

Explore products

Research

Techniques

Focus areas

Company

News & Events

Global partners

Duplex basecalling

- `filename`: The name of the input read file which the template read came from.

Discover nanopore sequencing

Explore products

Discover nanopore sequencing

Explore products

Research

Techniques

Focus areas

Research

Techniques

Focus areas

Company

News & Events

Global partners

Company

News & Events

Global partners

NCM 2024: Boston

Duplex basecalling

- filename: The name of the input read file which the template read came from.

Cookies Notice

- `filename`: The name of the input read file which the template read came from.