-
Duplex basecalling
Note: We recommend using our Dorado basecaller to perform duplex basecalling. For more information, please see the Dorado page on Github or the "Basecalling Kit 14 duplex data" in the Kit 14 Sequencing and Duplex Basecalling info sheet.
The Guppy toolkit now supports performing duplex basecalling, where the template and complement strands of a read can have their basecall data combined to provide a more accurate sequence. To perform duplex basecalling, the template and complement read pairs must first be identified.
The
guppy_basecaller_duplex
tool currently provides two options for pairing reads:from_pair_list
will instruct the Guppy duplex basecaller to use a text file containing read information indicating the source reads to be paired. This file may contain either two or eight whitespace-separated columns per line. The first two columns are the read ids of the reads to be duplexed. If the two reads are the results of a read being split, additional pairs of columns should be present specifying the parent read id, start time and duration (in seconds) of the read segments.from_1d_summary
will instruct the Guppy duplex basecaller to read in a previously-generated Guppy 1D basecall summary file. This file will be used to identify pairs of reads which were sequenced through the same flow cell and channel in rapid sucession, marking them as potential pairs. Split reads are handled automatically.
The
guppy_basecaller_duplex
executable can be launched as follows:
guppy_basecaller_duplex --input_path <path to reads> --save_path <output folder> -x "cuda:0" --config dna_r10.4.1_e8.2_400bps_sup.cfg --duplex_pairing_mode from_pair_list --duplex_pairing_file <text pair file>
Note that duplex basecalling is very resource-intensive (especially when using the highest accuracy models), so it is strongly recommended to use GPU mode basecalling if possible.
For further information on duplex basecalling, please see our Duplex Tools page on Github.
-
Additional arguments
In addition to the arguments supported by the 1D basecaller,
guppy_basecaller_duplex
supports these additional arguments:-
duplex_pairing_mode
: The read pairing mode to use for duplex basecalling. Must be one of 'from_1d_summary' or 'from_pair_list'. This argument must be specified. -
duplex_pairing_file
: The input filename to use for duplex pairing. Must be a list of pairs of read ids forfrom_pair_list
pairing, or a Guppy sequencing_summary file forfrom_1d_summary
pairing. This argument must be specified.
Note that when performing duplex basecalling, the sequencing summary will have different columns available. The following columns will have different information to their meaning in 1D basecalls:
-
read_id
: The uuid that uniquely identifies the template source strand of this duplex read.
-
filename
: The name of the input read file which the template read came from.The following column will be added:
-
duplex_pair_read_id
: The uuid that uniquely identifies the complement source strand of this duplex read.
-
-
Duplex basecalling is still in prototype support in Guppy, and there are some limitations to be aware of:
There is a maximum read size which is supported for duplex calling on GPU, based on the available device memory. It can be controlled by setting the
--chunks_per_runner
option for the duplex basecaller. To obtain best runtime performance, it is currently recommended to use the highest possible setting for--chunks_per_runner
which the device can support. Here are some recommendations - these will need to be adjusted down by the user if they are doing other work on the GPU (such as using Guppy for barcoding):Available GPU memory --chunks_per_runner setting Approximate maximum duplex read length 40 GB (e.g. A100) 1200 400 kb 32 GB (e.g. GV100) 900 300 kb 16 GB (e.g. V100) 450 150 kb 12 GB (e.g. GTX 1080 Ti) 320 106 kb
This limit on read length will be removed in future releases. -
Duplex pipeline
ont_guppy_duplex_pipeline
is a Python module that performs the sequence of processes required for duplex basecalling with Guppy. It is available on PyPI and can be installed via pip:
pip install ont-guppy-duplex-pipeline
The duplex pipeline comprises the following steps:
- (Optional) simplex (1D) basecalling.
- Identification of duplex pairs in the simplex basecall results.
- (Optional) duplex basecalling of those pairs.
- (Optional) simplex basecalling of all reads that were not part of a duplex pair, using the same configuration as the duplex basecalling.
To process reads with the duplex pipeline, call
guppy_duplex
with the required parameters:
guppy_duplex -i <read_folder> -s <output_folder>
The
guppy_duplex
pipeline also supports a number of optional parameters in addition to those required above.Executables
- Path to the basecaller executable (
--basecaller_exe
): If this is not set, the pipeline assumes thatguppy_basecaller
is already in the path. - Path to the duplex basecaller executable (
--duplex_basecaller_exe
): If this is not set, the pipeline assumes thatguppy_basecaller_duplex
is already in the path.
Input/output
- Recursive (
-r
or--recursive
): Flag to require searching through all subfolders contained in the--input_path
value, and basecall any files found in them.
Processing steps
- Skip simplex (
--skip_simplex
): Skip the initial simplex basecall step. The pipeline will assume that thesequencing_summary.txt
file exists. - Skip duplex (
--skip_duplex
): Skip the duplex basecall step. - Non-duplex reads (
--call_non_duplex_reads
): When specified, this flag runs an additional basecall (using the same configuration as the duplex basecall) on reads that did not participate in a duplex pair. As the duplex step is typically performed using a higher-accuracy configuration that the simplex step, this provides for high accuracy basecalls of the non-duplex reads as part of the pipeline. - Read splitting (
--do_read_splitting
): Perform read splitting during the initial simplex basecall to separate potentially concatenated reads.
Configurations
- Simplex configuration (
--simplex_config
): Sets the configuration to use for the initial simplex basecall. This will typically be a "fast" model. Defaults todna_r10.4.1_e8.2_400bps_fast.cfg
. - Duplex configuration (
--duplex_config
): Sets the configuration to use for the duplex basecall. This will typically be a "hac" or "sup" model. This configuration will also be used to call the non-duplex reads if--call_non_duplex_reads
is set. Defaults todna_r10.4.1_e8.2_400bps_sup.cfg
.
Optimisation
- GPU device (
-d
or--device
): Specify the CUDA-enabled GPU to use to perform basecalling. See the Optimisation section under "Guppy features, settings and analysis" for more details.- Duplex chunks per runner (
--duplex_chunks_per_runner
): Passed to the Guppy duplex basecaller when performing duplex basecalling. Decrease this value in case of out-of-memory errors.
- Duplex chunks per runner (
Other
- Disable logging (
--disable_logging
): Turns of logging.