-
Guppy basecalling software
Guppy is a data processing toolkit that contains the Oxford Nanopore Technologies' production basecalling algorithms and several bioinformatic post-processing features. It is run from the command line in Windows, Mac OS, and on multiple Linux platforms. Guppy is also integrated with our sequencing instrument software, MinKNOW, and a subset of Guppy features are available via the MinKNOW UI. A selection of configuration files allows basecalling of DNA and RNA libraries made with Oxford Nanopore Technologies’ current sequencing kits, in a range of flow cells.
The Guppy software contains many configurable parameters that can be used to specify exactly how the data analysis is performed. Adjusting some of these parameters requires a deep knowledge of nanopore data, and as such, Guppy is aimed at more advanced users. For those who are new to sequencing or have limited knowledge of sequencing data analysis, we recommend using the options presented in the MinKNOW software UI for basecalling.
-
Introduction to basecalling
Basecalling is the process of converting the electrical signals generated by a DNA or RNA strand passing through the nanopore into the corresponding base sequence of the strand. The general data flow in a nanopore sequencing experiment is shown below.
Raw data - a direct measurement of the changes in ionic current as a DNA/RNA strand passes through the pore, which are recorded by the MinKNOW software. MinKNOW also processes the signal into "reads", each read corresponding to a single strand of DNA/RNA. These reads are optionally written out as .fast5 files.
These .fast5 files use the HDF5 format to store data (http://www.hdfgroup.org/HDF5/); libraries exist to read and write these in many popular computer languages (e.g. R, Python, Perl, C, C++, Java).
Guppy also supports loading models from the POD5 file format.
Basecalling - the raw signal is further processed by the basecalling algorithm to generate the base sequence of the read.
Basecalling is made up of a series of steps that are executed one by one. Ionic current measurements from the sequencing device are collected by the MinKNOW software and processed into a read. The reads are transformed into basecalls using mathematical models. The results of these analyses are written into FASTQ or BAM files, with a default of 4000 reads per file. Note that .fast5 file writing support, which has been deprecated for some time now, has been officially removed and is no longer available.
-
Guppy basecalling models are based on a Recurrent Neural Network (RNN)
The Guppy basecalling models are based on RNNs. For more information about RNNs, as well as other basecalling options and algorithms, please refer to the Data Analysis document in the Nanopore Community.
-
The Guppy toolkit contains:
The basecaller
The Guppy basecaller implements a neural networks algorithm that allows raw data to be transformed into canonical bases of DNA or RNA, and several types of modified bases.
- Calibration strand detection: The basecaller is also capable of detecting calibration strands by aligning calibration sequences. Reads are aligned against a calibration reference using the basecalled data from an internally present DNA molecule in the flow cell. Calibration strands serve as a quality control for the pore and experimental processing. If the current read is identified as a calibration strand, no barcoding or alignment steps are performed.
- Adapter trimming: This is the processing and removal of the sequencing adapter (e.g. AMX, BAM, AMII, etc.) signal in the basecalled data:
- For DNA adapters it will exclude the non-sequence adapter region up to a characteristic signal in the adapter that is recognised by the basecaller.
- For (m)RNA, where the strands are sequenced in the 3' to 5' direction, it will attempt to exclude all data up to the the polyA tail.
Barcoding/demultiplexing
The beginning and the end of each strand are aligned against the barcodes currently provided by Oxford Nanopore Technologies. Demultiplexing occurs directly from the basecalled results.
Alignment
The user can provide a reference file in FASTA or minimap2 index format. If so, the reads are aligned against this reference via the integrated minimap2 aligner using the standard Oxford Nanopore Technologies preset parameters.
Modified basecalling
It is possible to use Guppy to identify certain types of modified bases: currently 5mC. This requires the use of a specific basecalling model which is trained to identify both modified and unmodified bases.
Note that 1D2 basecalling is no longer included in current versions of the Guppy software.
-
Current assumptions and limitations of Guppy
The Guppy basecalling software currently provides basecalling for 1D and duplex chemistry.
Read .fast5/.pod5 files, used as input to the basecalling software, must contain raw data. Raw data has been included by default in read files generated by the MinKNOW software for the last several years, so it should not be necessary to update them. Oxford Nanopore Technologies offers two sets of tools for working with .fast5 files that users may find helpful:
- ont_fast5_api: Provides a simple interface to the .fast5 format, including tools for converting between single- and multi-read formats.
- ont_h5_validator: Provides a tool for validating .fast5 file structures against official Oxford Nanopore Technologies file schemas.
Tools are also available for manipulating POD5 files in the pod5-file-format repository at https://github.com/nanoporetech/pod5-file-format .
Guppy provides configurations for currently-available chemistries and also provides a model compatible with data generated using older PromethION firmware.
Both the alignment and barcoding pipelines accept compressed and uncompressed FASTQ files as input. These can be generated either by the Guppy basecallers, or by the MinKNOW software.
-
General system requirements for running Guppy
These system requirements are guidelines - the actual amount of memory and disk space required to run Guppy tools will heavily depend on options and input data.
- 4 GB RAM plus 1 GB per thread for basecalling (more RAM may be required for duplex basecalling)
- Administrator access for .deb or .msi installers
- ~2 GB of drive space for installation are required. A minimum of 512 GB storage space for basecalled read files is recommended.
-
CPU and GPU basecalling with Guppy
Oxford Nanopore Technologies provides Guppy executables that can be run on Central Processing Units (CPUs) on Windows, Mac OS and Linux, or on Graphics Processing Units (GPUs) on Windows and certain Linux platforms:
Windows: ont-guppy-cpu .msi installer (CPU) or ont-guppy .msi installer (GPU)
macOS: ont-guppy-cpu .dmg installer (CPU only)
Linux CPU:- ont-guppy-cpu .deb for Ubuntu 16
- ont-guppy-cpu .deb for Ubuntu 18
- ont-guppy-cpu .deb for Ubuntu 20
- ont-guppy-cpu .rpm for Centos 7
- ont-guppy-cpu .rpm for Centos 8
- ont-guppy-cpu .tar.gz – general Linux archives with pre-built binaries (compatible with most Linux versions)
Linux GPU:
- ont-guppy .deb for Ubuntu 16
- ont-guppy .deb for Ubuntu 18
- ont-guppy .deb for Ubuntu 20
- ont-guppy .rpm for Centos 7
- ont-guppy .rpm for Centos 8
- ont-guppy .tar.gz – general Linux archives with pre-built binaries (compatible with most Linux versions). On the ARM platform these archives are split into CUDA 9 and CUDA 10 versions (for use with Linux 4 Tegra running Ubuntu 16 and Ubuntu 18, respectively).
Note that GPU basecalling is only supported on Linux and Windows systems. Mac OS systems do not currently have NVIDIA GPUs or CUDA support.
GPU basecalling requires NVIDIA drivers which support a minimum CUDA version of:
- CUDA 10 for Linux 4 Tegra running Ubuntu 18
- CUDA 11.1 for Linux x86 systems
- CUDA 11.4 for Windows systems
Please note that Guppy is not currently compatible with CUDA 12.0 onwards. The last compatible version of the CUDA toolkit is 11.8. CUDA 11.8 can be downloaded from the NVIDIA download archive.
In general it is recommended to install the latest GPU drivers available for your system and graphics card. See the NVIDIA driver download page for details.
Using external GPUs can dramatically increase basecalling speed. Guppy works with only NVIDIA GPUs, and has been tested using the following specific models:
- NVIDIA Tesla V100
- NVIDIA Quadro GV100
- NVIDIA GTX1080Ti
- NVIDIA Jetson TX2
- NVIDIA Jetson Xavier
If working with a different model of NVIDIA GPU than those listed above, the Guppy software requires CUDA Compute Capability >6.1 (for more information about CUDA-enabled GPUs, see the NVIDIA website)
It is possible to use other NVIDIA GPUs for basecalling, however Oxford Nanopore Technologies develops and tests software on the models stated above, so support for other models is limited.
-
Fast, High Accuracy and Super Accurate models and compatibilities
The Dorado basecallers offer three different basecalling models: a Fast model, a High accuracy (HAC) model, and Super accurate (SUP) model.
The Fast model is designed to keep up with data generation on Oxford Nanopore devices (MinION Mk1C, GridION, PromethION). The HAC model provides a higher raw read accuracy than the Fast model and is more computationally-intensive. The Super accurate model has an even higher raw read accuracy, and is even more intensive than the HAC model.
For more information about basecalling accuracy, please consult the Accuracy page on the Oxford Nanopore website.
A comparison of the speed of the models is provided in the table below:
The number of keep-up flow cells assumes a 30 Gbase flow cell output in 72 hours for MinION and GridION, and 150 Gbase output in 72 hours for PromethION.
-
Basecalling speed for Guppy
Aside from the basecalling model, the time taken to basecall a folder of reads depends on the specifications of the computer, the number of threads assigned, the options which Guppy is invoked with, and the number of reads analysed. Guppy is optimised for NVIDIA GPUs using CUDA, and can perform several orders of magnitude faster running on a modern GPU compared to a standard desktop CPU.