-
Config files - variable parameters
In addition, Guppy must know which basecalling configuration to use. This can be provided in one of two ways:
- By selecting a config file:
- Config (
-c
or--config
): either the name of the config file to use, or a full path to a config file (see the section below). If the argument is only the name of a config file then it must correspond to one of the standard configuration files provided by the package.
- Config (
- Or by selecting a flow cell and a kit:
- Flow cell (
-f
or--flowcell
): the name of the flow cell used for sequencing (e.g. FLO-MIN106). - Kit (
-k
or--kit
): the name of the kit used for sequencing (e.g. SQK-LSK109).
- Flow cell (
Note: If you use the
--config
argument, then--flowcell
and--kit
arguments are not needed and will be ignored. - By selecting a config file:
-
Choosing a config file for Guppy
Guppy contains several types of basecalling configurations, many of which are not available by using the flow cell and kit selector. These models will usually have their own config file, and they may then be used with the
--config
argument.Generally speaking, the configuration file names are structured as follows:
<strand_type>_<pore_type>_<enzyme_type>_[modbases_specifier]_<model_type>_[instrument_type].cfg
-
strand_type
: This will be either the string "dna" or "rna", depending on the type of sequencing being performed. -
pore_type
: The pore the basecalling model was trained for, indicated by the letter "r" followed by a version number. For example: "r9.4.1" or "r10.4". -
enzyme_type
: The enzyme motor the model was trained for. This will either be the letter "e" followed by a version number, or a number indicating the enzyme speed, followed by "bps". For example: "e8.1" or "450bps". -
modbase_specifier
: Optional. If specified, indicates that modified base detection will be performed. This will be the string "modbases_" followed by an indicator of the modification supported, such as "5mc_cg" or "5hmc_5mc_cg". -
model_type
: The type of basecalling model to use, depending on whether you want optimal basecalling speed or accuracy. See below. -
instrument_type
: Optional. If this is not specified, then the configuration is targeted to a GridION device or a PC. The strings "mk1c" or "prom" are used to indicate that the configuration parameters and model are optimised for the MinION Mk1C or PromethION devices, respectively. Note that if the kit and flow cell are specified on the command-line instead of a specific config file, then the config file chosen will be one without an instrument type specified.
The model types are:
-
sup
: Super-accurate basecalling. -
hac
: High accuracy basecalling. These are the configurations that will be selected when a kit and flow cell are specified on the command-line instead of a specific config file. -
fast
: Fast basecalling. -
sketch
: Sketch basecalling. This is primarily for use with adaptive sampling on the MinION Mk1C device to minimise latency.
For example, to basecall data generated with the R10.4 pore and the E8.1 enzyme, using the Fast CRF model:
guppy_basecaller -c dna_r10.4_e8.1_fast.cfg [...]
If you were running this on a MinION Mk1C device, you would use:
guppy_basecaller -c dna_r10.4_e8.1_fast_mk1c.cfg [...]
-
-
Config files - selecting kit and flow cell
These should be clearly labelled on the corresponding boxes. Flow cells almost always start with "FLO" and kits almost always start with "SQK" or "VSK".
To see the supported flow cells and kits, run Guppy with the
--print_workflows
option:
guppy_basecaller --print_workflows
...which will produce output like this:
Available flowcell + kit combinations are:
flowcell kit barcoding config_name model version
FLO-MIN114 SQK-LSK114 dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-LSK114-XL dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-ULK114 dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-RAD114 dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-NBD114-24 included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-NBD114-96 included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-RBK114-24 included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-RBK114-96 included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-PRO002 SQK-LSK112 dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-LSK112-XL dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-RAD112 dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-NBD112-24 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-NBD112-96 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-RBK112-24 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-RBK112-96 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-LSK112 dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-LSK112-XL dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-RAD112 dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-NBD112-24 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-NBD112-96 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-RBK112-24 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-RBK112-96 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-LSK112 dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-LSK112-XL dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-RAD112 dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-NBD112-24 included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-NBD112-96 included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-RBK112-24 included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-RBK112-96 included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO111 SQK-CS9109 dna_r10.3_450bps_hac_prom 2021-04-20_dna_r10.3_minion_promethion_384_72309afc
FLO-PRO111 SQK-DCS108 dna_r10.3_450bps_hac_prom 2021-04-20_dna_r10.3_minion_promethion_384_72309afc
FLO-PRO111 SQK-DCS109 dna_r10.3_450bps_hac_prom 2021-04-20_dna_r10.3_minion_promethion_384_72309afc
[...]
In the case of kits which come with their own barcodes included, the barcoding column will specify "included". Reads which have been prepared with these kits will be able to be demultiplexed usingguppy_barcoder
(see below). -
Optional parameters
In addition to the required parameters described in the Quick Start section, Guppy has many optional parameters. You can use them if they are applicable to your experiment. The following optional parameters are commonly used:
Data features:
- Q-score filtering (
--disable_qscore_filtering
): Flag to disable filtering of reads into pass/fail folders inside the output folder, based on their strand q-score. See--min_qscore
. - Alignment filtering (
--alignment_filtering
): Flag for filtering of reads into pass/fail folders inside the output folder, based on their number of alignments. Can be set tonone
(default) orfail
to disable or enable this feature. - Minimum q-score (
--min_qscore
): The minimum q-score a read must attain to pass q-score filtering. The default value for this varies by configuration, ranging from 7.0 for the lower-accuracy models up to 10.0 for the "Sup" models. This should have a minimal impact on output. - Calibration strand detection (
--calib_detect
): Flag to enable calibration strand detection and filtering. If enabled, any reads which align to the calibration strand reference will be filtered into a separate output folder to simplify downstream processing. Off by default. - Alignment reference file (
-a
or--align_ref
): Optional reference genome file name. If an align_ref is provided, Guppy will perform alignment against the reference for called strands, using the minimap2 library. Providing analign_ref
will automatically enable BAM output (see--bam_out
). See the Alignment section for more information on alignment in Guppy. - Reverse RNA sequence (
--reverse_sequence
): Reverse the called sequence (used for RNA sequencing, as RNA strands translocate through the pore in the 3’ to 5’ direction). The default value isFALSE
for DNA sequencing andTRUE
for RNA sequencing. - Perform T to U substitution (
--u_substitution
): Replace all 'T's in the called sequence with 'U's for RNA sequencing. The default value isFALSE
for DNA sequencing andTRUE
for RNA sequencing. - Read splitting (
--do_read_splitting
): Split potentially concatenated input reads into separate outputs, based on the score obtained from mid-strand adapter detection. See--min_score_read_splitting
. If enabled, reads which exceed this threshold will be split into two. - Read splitting depth (
--max_read_split_depth
): Limit the number of times a read will be passed into the read splitter. e.g.--max_read_split_depth 2
would permit the read to be split, and then each resulting read to be split a second time, resulting in up to four reads. The default value is 2. - Minimum read splitting score (
--min_score_read_splitting
): The minimum score a read must generate from mid-strand adapter detection for the read to be considered a concatamer and to be split into two reads for subsequent processing and output. The default is 58.
Input/output:
- Quiet mode (
-z
or--quiet
): This option prevents the Guppy basecaller from outputting anything to stdout. Stdout is short for “standard output” and is the default location to which a running program sends its output. For a command line executable, stdout will typically be sent to the terminal window from which the program was run. - Verbose logging (
--verbose_logs
): Flag to enable verbose logging (outputting a verbose log file, in addition to the standard log files, which contains detailed information about the application). Off by default. - Reads per FASTQ file (
-q
or--records_per_fastq
): The number of reads to put in a single FASTQ file (see output format below). Set this to zero to output all reads into one file (per run id, per caller). The default value is 4000. - Perform FASTQ compression (
--compress_fastq
): Flag to enable gzip compression of output FASTQ files; this reduces file size to about 50% of the original. - Recursive (
-r
or--recursive
): Flag to require searching through all subfolders contained in the--input_path
value, and basecall any .fast5 files found in them. - .bam file output (
--bam_out
): Flag to enable output of .bam files containing basecall result sequence. If a modified base model was used, the modified base locations and probabilities will be emitted. If alignment was performed, the results will also be emitted. Off by default. - .bam file indexing (
--index
): Flag to enable the generation of the .bai index file for .bam file output. Requires--bam_out
. BAM file output will be implicitly enabled if--align_ref
is popultated or a modbase model is selected. Off by default. - Emit move tables (
--moves_out
): Return move table in output BAM file. - Methylation probability cutoff (
--bam_methylation_threshold
): The value below which a predicted methylation probability will not be emitted into a BAM file, expressed as a percentage. Default is 5.0(%). Note that if the configuration being used specifies a context to look for base modifications within, then this parameter will not be applied. Instead, any instances of the base which match the context will be emitted in the BAM file, even if the predicted methylation probability is zero. - Override default data path (
-d
or--data_path
): Option to explicitly specify the path to use for loading any data files the application requires (for example, if you have created your own model files or config files). - Input File List (
--input_file_list
): Optional file containing list of input read files (.fast5/POD5) to process from the input_path. - Nested output folder structure (
--nested_output_folder
): Optional flag, which if set will cause FASTQ files to be output to a nested folder structure similar to that used by MinKNOW. - Progress stats reporting frequency (
--progress_stats_frequency
): Frequency in seconds in which to report progress statistics, if supplied will replace the default progress display. - Maximum queue size (
--max_queued_reads
): Maximum number of reads "in flight", defaults to 2000. Helps to limit the amount of memory used in the case where basecalling cannnot keep up with the speed reads are loaded.
Optimisation:
- Chunks per caller (
--chunks_per_caller
): A soft limit on the number of chunks in each basecaller's chunk queue. When a read is sent to the basecaller, it is broken up into “chunks” of signal, and each chunk is basecalled in isolation. Once all the chunks for a read have been basecalled, they are combined to produce a full basecall.--chunks_per_caller
sets a limit on how many chunks will be collected before they are dispatched for basecalling. On GPU platforms this is an important parameter to obtain good performance, as it directly influences how much computation can be done in parallel by a single basecaller. - Number of parallel callers (
--num_callers
): Number of parallel basecallers to create. A thread will be spawned for each basecaller to use. Increasing this number will allow Guppy to make better use of multi-core CPU systems, but may impact overall system performance. GPU device (
-x
or--device
): Specify a GPU device to use in order to accelerate basecalling. If this option is not selected, Guppy will default to CPU usage. You can specify one or more devices as well as optionally limiting the amount of GPU memory used (to leave space for other tasks to run on GPUs). GPUs are counted from zero, and the memory limit can be specified as percentage of total GPU memory or as size in bytes. Examples:device result cuda:0
Use the first GPU in the system, no memory limit cuda:0,1
Use the first two GPUs in the system, no memory limit "cuda:0 cuda:1"
Same as cuda:0,1
cuda:all:100%
Use all GPUs in the system, no memory limit cuda:1,2:50%
Use the second and third GPU in the system, and use only up to half of the GPU memory of each GPU "cuda:0 cuda:1,2:8G"
Use the first three GPUs in the system. Use a maximum of 8 GiB on each of GPUs 1 and 2. auto
Same as cuda:0
Note: Spaces are only allowed between multiple cuda: specifications. In this case it is necessary to put the entire device specification in quotes. It is strongly recommended to use a supported GPU if one is available, as basecalling will typically perform orders of magnitude faster.
Resume previous run (
--resume
): Flag to enable resuming a previous basecalling run. This option can be used to resume a partially completed basecall if it was interrupted for some reason, or to re-basecall an input directory if more reads were added.
- Q-score filtering (
-
CPU/GPU basecalling usage
There are two parameters that govern how many CPU threads Guppy uses: callers and CPU threads per caller.
When performing GPU basecalling, there is always one CPU support thread per GPU caller, so the number of callers (
--num_callers
) dictates the maximum number of CPU threads used. Modifying the number of CPU threads per caller (--num_cpu_threads_per_caller
) will have no effect.When performing CPU basecalling both callers and threads per caller may be set, making the maximum number of CPU threads used equal to num_callers * cpu_threads_per_caller.
The number of CPU threads used should generally not exceed either of these two values:
- The number of logical CPU cores your machine has (as there will probably not be sufficient computational power available for Guppy to run any faster than this).
- When performing CPU basecalling, more than the number of CPU threads your machine's RAM can support 4GB + 1GB per CPU thread for 1D basecalling
So if your machine has 8 GB of RAM, you can support a maximum of 4 CPU threads for 1D basecalling.
This assumes your machine is not performing any other computationally-intensive tasks except for using Guppy (e.g. it assumes you are not running MinKNOW).
-
Resuming runs
If a run of the Guppy basecaller is interrupted for some reason, it is possible to use the
--resume
option to attempt to re-start the basecall from where it was halted. This is useful if basecalling fails during processing particularly large batches of files. Resume should be used with exactly the same parameters as the previous run, or undefined behaviour may occur. If the--resume
option is specified, the following steps occur:- The basecaller checks the output directory to find log files from any previous runs
- The log files are interrogated to discover any successfully completed reads (and their source files) from previous runs
- Any files in the output directory, which do not belong to successfully completed reads, are removed (i.e. reads which were partially completed)
- The data for previously completed reads is extracted from the summary file for the previous run
The basecaller then proceeds as normal, filtering out any input reads which were previously processed.
After resumption of a basecall run, a single summary file will have been produced with all reads from the input folder in it, as if the run was completed normally.
Note: It is permissible to chain resume operations together, and it is permissible to resume from a successfully completed operation. This allows the resume functionality to be used to re-basecall an input folder in order to basecall just the read files which have appeared in that folder since the last basecall operation was invoked on it.
The resume system works by batching reads internally, and recording to the logfile when those batches have been completed and written to disk. The
--read_batch_size
argument can be set to control the size of these batches, and controls the granularity at which resume operations can occur. Increasing the batch size will reduce the fragmentation of output FASTQ files but can increase the amount of time a resume operation takes, as more previously basecalled reads may be re-called, because their batch was not completed.