Community

Config files - variable parameters

In addition, Guppy must know which basecalling configuration to use. This can be provided in one of two ways:
- By selecting a config file:
  - Config (-c or --config): either the name of the config file to use, or a full path to a config file (see the section below). If the argument is only the name of a config file then it must correspond to one of the standard configuration files provided by the package.
- Or by selecting a flow cell and a kit:
  - Flow cell (-f or --flowcell): the name of the flow cell used for sequencing (e.g. FLO-MIN106).
  - Kit (-k or --kit): the name of the kit used for sequencing (e.g. SQK-LSK109).
Note: If you use the --config argument, then --flowcell and --kit arguments are not needed and will be ignored.
Choosing a config file for Guppy

Guppy contains several types of basecalling configurations, many of which are not available by using the flow cell and kit selector. These models will usually have their own config file, and they may then be used with the --config argument.

Generally speaking, the configuration file names are structured as follows:

<strand_type>_<pore_type>_<enzyme_type>_[modbases_specifier]_<model_type>_[instrument_type].cfg
- strand_type: This will be either the string "dna" or "rna", depending on the type of sequencing being performed.
- pore_type: The pore the basecalling model was trained for, indicated by the letter "r" followed by a version number. For example: "r9.4.1" or "r10.4".
- enzyme_type: The enzyme motor the model was trained for. This will either be the letter "e" followed by a version number, or a number indicating the enzyme speed, followed by "bps". For example: "e8.1" or "450bps".
- modbase_specifier: Optional. If specified, indicates that modified base detection will be performed. This will be the string "modbases_" followed by an indicator of the modification supported, such as "5mc_cg" or "5hmc_5mc_cg".
- model_type: The type of basecalling model to use, depending on whether you want optimal basecalling speed or accuracy. See below.
- instrument_type: Optional. If this is not specified, then the configuration is targeted to a GridION device or a PC. The strings "mk1c" or "prom" are used to indicate that the configuration parameters and model are optimised for the MinION Mk1C or PromethION devices, respectively. Note that if the kit and flow cell are specified on the command-line instead of a specific config file, then the config file chosen will be one without an instrument type specified.
The model types are:
- sup: Super-accurate basecalling.
- hac: High accuracy basecalling. These are the configurations that will be selected when a kit and flow cell are specified on the command-line instead of a specific config file.
- fast: Fast basecalling.
- sketch: Sketch basecalling. This is primarily for use with adaptive sampling on the MinION Mk1C device to minimise latency.
For example, to basecall data generated with the R10.4 pore and the E8.1 enzyme, using the Fast CRF model:

guppy_basecaller -c dna_r10.4_e8.1_fast.cfg [...]

If you were running this on a MinION Mk1C device, you would use:

guppy_basecaller -c dna_r10.4_e8.1_fast_mk1c.cfg [...]
Config files - selecting kit and flow cell

These should be clearly labelled on the corresponding boxes. Flow cells almost always start with "FLO" and kits almost always start with "SQK" or "VSK".

To see the supported flow cells and kits, run Guppy with the --print_workflows option:

guppy_basecaller --print_workflows

...which will produce output like this:

Available flowcell flowcell kit FLO-MIN114 SQK-LSK114 FLO-MIN114 SQK-LSK114-XL FLO-MIN114 SQK-ULK114 FLO-MIN114 SQK-RAD114 FLO-MIN114 SQK-NBD114-24 FLO-MIN114 SQK-NBD114-96 FLO-MIN114 SQK-RBK114-24 FLO-MIN114 SQK-RBK114-96 FLO-PRO002 SQK-LSK112 FLO-PRO002 SQK-LSK112-XL FLO-PRO002 SQK-RAD112 FLO-PRO002 SQK-NBD112-24 FLO-PRO002 SQK-NBD112-96 FLO-PRO002 SQK-RBK112-24 FLO-PRO002 SQK-RBK112-96 FLO-PRO002M SQK-LSK112 FLO-PRO002M SQK-LSK112-XL FLO-PRO002M SQK-RAD112 FLO-PRO002M SQK-NBD112-24 FLO-PRO002M SQK-NBD112-96 FLO-PRO002M SQK-RBK112-24 FLO-PRO002M SQK-RBK112-96 FLO-MIN106 SQK-LSK112 FLO-MIN106 SQK-LSK112-XL FLO-MIN106 SQK-RAD112 FLO-MIN106 SQK-NBD112-24 FLO-MIN106 SQK-NBD112-96 FLO-MIN106 SQK-RBK112-24 FLO-MIN106 SQK-RBK112-96 FLO-PRO111 SQK-CS9109 FLO-PRO111 SQK-DCS108 FLO-PRO111 SQK-DCS109 [...]

In the case of kits + kit combinations are:
barcoding config_name model version
dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
dna_r10.3_450bps_hac_prom 2021-04-20_dna_r10.3_minion_promethion_384_72309afc
dna_r10.3_450bps_hac_prom 2021-04-20_dna_r10.3_minion_promethion_384_72309afc
dna_r10.3_450bps_hac_prom 2021-04-20_dna_r10.3_minion_promethion_384_72309afc
which come with their own barcodes included, the barcoding column will specify "included". Reads which have been prepared with these kits will be able to be demultiplexed using guppy_barcoder (see below).

Optional parameters

In addition to the required parameters described in the Quick Start section, Guppy has many optional parameters. You can use them if they are applicable to your experiment. The following optional parameters are commonly used:

Data features:

Q-score filtering (--disable_qscore_filtering): Flag to disable filtering of reads into pass/fail folders inside the output folder, based on their strand q-score. See --min_qscore.
Alignment filtering (--alignment_filtering): Flag for filtering of reads into pass/fail folders inside the output folder, based on their number of alignments. Can be set to none (default) or fail to disable or enable this feature.
Minimum q-score (--min_qscore): The minimum q-score a read must attain to pass q-score filtering. The default value for this varies by configuration, ranging from 7.0 for the lower-accuracy models up to 10.0 for the "Sup" models. This should have a minimal impact on output.
Calibration strand detection (--calib_detect): Flag to enable calibration strand detection and filtering. If enabled, any reads which align to the calibration strand reference will be filtered into a separate output folder to simplify downstream processing. Off by default.
Alignment reference file (-a or --align_ref): Optional reference genome file name. If an align_ref is provided, Guppy will perform alignment against the reference for called strands, using the minimap2 library. Providing an align_ref will automatically enable BAM output (see --bam_out). See the Alignment section for more information on alignment in Guppy.
Reverse RNA sequence (--reverse_sequence): Reverse the called sequence (used for RNA sequencing, as RNA strands translocate through the pore in the 3’ to 5’ direction). The default value is FALSE for DNA sequencing and TRUE for RNA sequencing.
Perform T to U substitution (--u_substitution): Replace all 'T's in the called sequence with 'U's for RNA sequencing. The default value is FALSE for DNA sequencing and TRUE for RNA sequencing.
Read splitting (--do_read_splitting): Split potentially concatenated input reads into separate outputs, based on the score obtained from mid-strand adapter detection. See --min_score_read_splitting. If enabled, reads which exceed this threshold will be split into two.
Read splitting depth (--max_read_split_depth): Limit the number of times a read will be passed into the read splitter. e.g. --max_read_split_depth 2 would permit the read to be split, and then each resulting read to be split a second time, resulting in up to four reads. The default value is 2.
Minimum read splitting score (--min_score_read_splitting): The minimum score a read must generate from mid-strand adapter detection for the read to be considered a concatamer and to be split into two reads for subsequent processing and output. The default is 58.

Input/output:

Quiet mode (-z or --quiet): This option prevents the Guppy basecaller from outputting anything to stdout. Stdout is short for “standard output” and is the default location to which a running program sends its output. For a command line executable, stdout will typically be sent to the terminal window from which the program was run.
Verbose logging (--verbose_logs): Flag to enable verbose logging (outputting a verbose log file, in addition to the standard log files, which contains detailed information about the application). Off by default.
Reads per FASTQ file (-q or --records_per_fastq): The number of reads to put in a single FASTQ file (see output format below). Set this to zero to output all reads into one file (per run id, per caller). The default value is 4000.
Perform FASTQ compression (--compress_fastq): Flag to enable gzip compression of output FASTQ files; this reduces file size to about 50% of the original.
Recursive (-r or --recursive): Flag to require searching through all subfolders contained in the --input_path value, and basecall any .fast5 files found in them.
.bam file output (--bam_out): Flag to enable output of .bam files containing basecall result sequence. If a modified base model was used, the modified base locations and probabilities will be emitted. If alignment was performed, the results will also be emitted. Off by default.
.bam file indexing (--index): Flag to enable the generation of the .bai index file for .bam file output. Requires --bam_out. BAM file output will be implicitly enabled if --align_ref is popultated or a modbase model is selected. Off by default.
Emit move tables (--moves_out): Return move table in output BAM file.
Methylation probability cutoff (--bam_methylation_threshold): The value below which a predicted methylation probability will not be emitted into a BAM file, expressed as a percentage. Default is 5.0(%). Note that if the configuration being used specifies a context to look for base modifications within, then this parameter will not be applied. Instead, any instances of the base which match the context will be emitted in the BAM file, even if the predicted methylation probability is zero.
Override default data path (-d or --data_path): Option to explicitly specify the path to use for loading any data files the application requires (for example, if you have created your own model files or config files).
Input File List (--input_file_list): Optional file containing list of input read files (.fast5/POD5) to process from the input_path.
Nested output folder structure (--nested_output_folder): Optional flag, which if set will cause FASTQ files to be output to a nested folder structure similar to that used by MinKNOW.
Progress stats reporting frequency (--progress_stats_frequency): Frequency in seconds in which to report progress statistics, if supplied will replace the default progress display.
Maximum queue size (--max_queued_reads): Maximum number of reads "in flight", defaults to 2000. Helps to limit the amount of memory used in the case where basecalling cannnot keep up with the speed reads are loaded.

Optimisation:

Chunks per caller (--chunks_per_caller): A soft limit on the number of chunks in each basecaller's chunk queue. When a read is sent to the basecaller, it is broken up into “chunks” of signal, and each chunk is basecalled in isolation. Once all the chunks for a read have been basecalled, they are combined to produce a full basecall. --chunks_per_caller sets a limit on how many chunks will be collected before they are dispatched for basecalling. On GPU platforms this is an important parameter to obtain good performance, as it directly influences how much computation can be done in parallel by a single basecaller.
Number of parallel callers (--num_callers): Number of parallel basecallers to create. A thread will be spawned for each basecaller to use. Increasing this number will allow Guppy to make better use of multi-core CPU systems, but may impact overall system performance.

GPU device (-x or --device): Specify a GPU device to use in order to accelerate basecalling. If this option is not selected, Guppy will default to CPU usage. You can specify one or more devices as well as optionally limiting the amount of GPU memory used (to leave space for other tasks to run on GPUs). GPUs are counted from zero, and the memory limit can be specified as percentage of total GPU memory or as size in bytes. Examples:

device	result
`cuda:0`	Use the first GPU in the system, no memory limit
`cuda:0,1`	Use the first two GPUs in the system, no memory limit
`"cuda:0 cuda:1"`	Same as `cuda:0,1`
`cuda:all:100%`	Use all GPUs in the system, no memory limit
`cuda:1,2:50%`	Use the second and third GPU in the system, and use only up to half of the GPU memory of each GPU
`"cuda:0 cuda:1,2:8G"`	Use the first three GPUs in the system. Use a maximum of 8 GiB on each of GPUs 1 and 2.
`auto`	Same as `cuda:0`

Note: Spaces are only allowed between multiple cuda: specifications. In this case it is necessary to put the entire device specification in quotes. It is strongly recommended to use a supported GPU if one is available, as basecalling will typically perform orders of magnitude faster.

Resume previous run (--resume): Flag to enable resuming a previous basecalling run. This option can be used to resume a partially completed basecall if it was interrupted for some reason, or to re-basecall an input directory if more reads were added.

CPU/GPU basecalling usage

There are two parameters that govern how many CPU threads Guppy uses: callers and CPU threads per caller.

When performing GPU basecalling, there is always one CPU support thread per GPU caller, so the number of callers (--num_callers) dictates the maximum number of CPU threads used. Modifying the number of CPU threads per caller (--num_cpu_threads_per_caller) will have no effect.

When performing CPU basecalling both callers and threads per caller may be set, making the maximum number of CPU threads used equal to num_callers * cpu_threads_per_caller.

The number of CPU threads used should generally not exceed either of these two values:
- The number of logical CPU cores your machine has (as there will probably not be sufficient computational power available for Guppy to run any faster than this).
- When performing CPU basecalling, more than the number of CPU threads your machine's RAM can support 4GB + 1GB per CPU thread for 1D basecalling
So if your machine has 8 GB of RAM, you can support a maximum of 4 CPU threads for 1D basecalling.

This assumes your machine is not performing any other computationally-intensive tasks except for using Guppy (e.g. it assumes you are not running MinKNOW).
Resuming runs

If a run of the Guppy basecaller is interrupted for some reason, it is possible to use the --resume option to attempt to re-start the basecall from where it was halted. This is useful if basecalling fails during processing particularly large batches of files. Resume should be used with exactly the same parameters as the previous run, or undefined behaviour may occur. If the --resume option is specified, the following steps occur:
- The basecaller checks the output directory to find log files from any previous runs
- The log files are interrogated to discover any successfully completed reads (and their source files) from previous runs
- Any files in the output directory, which do not belong to successfully completed reads, are removed (i.e. reads which were partially completed)
- The data for previously completed reads is extracted from the summary file for the previous run
The basecaller then proceeds as normal, filtering out any input reads which were previously processed.

After resumption of a basecall run, a single summary file will have been produced with all reads from the input folder in it, as if the run was completed normally.

Note: It is permissible to chain resume operations together, and it is permissible to resume from a successfully completed operation. This allows the resume functionality to be used to re-basecall an input folder in order to basecall just the read files which have appeared in that folder since the last basecall operation was invoked on it.

The resume system works by batching reads internally, and recording to the logfile when those batches have been completed and written to disk. The --read_batch_size argument can be set to control the size of these batches, and controls the granularity at which resume operations can occur. Increasing the batch size will reduce the fragmentation of output FASTQ files but can increase the amount of time a resume operation takes, as more previously basecalled reads may be re-called, because their batch was not completed.

Discover nanopore sequencing

Explore products

Research

Techniques

Focus areas

Company

News & Events

Global partners

Setting up a run: configurations and parameters

Data features:

Input/output:

Optimisation:

Discover nanopore sequencing

Explore products

Discover nanopore sequencing

Explore products

Research

Techniques

Focus areas

Research

Techniques

Focus areas

Company

News & Events

Global partners

Company

News & Events

Global partners

NCM 2024: Boston

Setting up a run: configurations and parameters

Data features:

Input/output:

Optimisation:

Cookies Notice