-
Input files
Read .fast5 files, used as input to the basecalling software, must contain raw data. Raw data is included by default in .fast5 files generated by the MinKNOW software. Make sure you are using recent .fast5 files from the latest version of MinKNOW, as older files may not basecall properly with the set-out models and parameters provided in stand-alone Guppy.
POD5 files are also supported as input.
Both the alignment and barcoding software accept FASTQ files as input. These can be generated either by the Guppy basecallers or by the MinKNOW software.
-
Output file size
If you start with a .fast5 file that only has raw data in it and .fast5 output is enabled, file size increases to roughly 2X original size for 1D basecalling.
-
Folder structure
If using a version of MinKNOW which outputs reads in separate subfolders, it is necessary to use the
--recursive
option listed above to search through them to find input read files.For example, if MinKNOW's output folder structure looks like this:
minknow_output_folder/
--- 0/
| --- file1.fast5
| --- file2.fast5
| [...]
--- 1/
| --- file10.fast5
| --- file11.fast5
| [...]
Then calling Guppy as follows will search through the numbered subfolders for input read files:
guppy_basecaller --input_path minknow_output_folder --recursive [...]
-
Output formats
Guppy supports outputting FASTQ files, and optionally BAM, via the
--bam_out
argument. By default, FASTQ or BAM files will contain 4000 reads per file, according to the--records_per_fastq
argument.Multiple input files from the same run_id will be grouped into batches, where the number of reads in a batch is less than or equal to
--read_batch_size
. Individual input files will not be split across batches, even if this means a batch is larger than--read_batch_size
. Output files for a batch will be split when--records_per_fastq
reads have been recorded. In the case where--records_per_fastq
is set to 0, all reads from a batch will be written into a single file (per run_id).The default FASTQ header is:
{read_id} runid={run_id} read={read_number} ch={channel_id} start_time={start_time_utc}
- read_id is the unique ID for the read.
- sample_id is the user-specified sample ID which the read belongs to (read from
tracking_id/sample_id
in the source read file). - read_number is the sequential read number for the channel (read from the read's
read_number
in the source read file). - channel_id is the source channel within the flow cell for the read (read from the read's
channel_number
in the source read file). - start_time_utc is the read's start time (calculated from
tracking_id/exp_start_time
and the read'sstart_time
in the source read file).
If barcoding was performed, the FASTQ header will also include a
barcodeid={barcode}
field, wherebarcode
is the normalised ID of the detected barcode arrangement.If read splitting was performed, the FASTQ header will also include a
parent_read_id={parent_read_id}
field, whereparent_read_id
is theread_id
of the original read from which this read was split. -
Contents of the output folder
The save path will have the following structure once Guppy has finished running:
-
guppy_basecaller_<time_and_date>.log
A log file of what Guppy did during this basecall session. -
sequencing_summary.txt
A tab-delimited text file containing useful information for each read analysed during this Guppy basecall. -
fastq_runid_<run_id>_<batch_id>_<file_number>.fastq
A collection of FASTQ files will be emitted containing the basecall results. Each FASTQ file may contain many reads. A set of FASTQ files will be generated for each run ID in the input file set. Additionally, depending on the--read_batch_size
and--records_per_fastq
settings, a single run ID may generate multiple FASTQ files.
Note: The FASTQ files in the output folder may be separated into "pass", "fail", and "calibration_strands" folders, depending on whether they pass or fail the filtering conditions or whether they have been identified as a calibration strand. This behaviour may be controlled with the
--disable_qscore_filtering
and--calib_detect
options. For example, if both options are enabled, the output folder structure would look like this:
guppy_output_folder/
--- pass/
| fastq_runid_777_0.fastq
| fastq_runid_abc_0.fastq
| fastq_runid_abc_1.fastq
--- fail/
| fastq_runid_777_0.fastq
--- calibration_strands/
| fastq_runid_777_0.fastq
Whereas turning both options off would produce a folder layout like this:
guppy_output_folder/
| fastq_runid_777_0.fastq
| fastq_runid_abc_0.fastq
| fastq_runid_abc_1.fastq
If barcode detection was performed, Guppy will demultiplex the reads into separate subfolders (within the 'pass' and 'fail' and 'calibration_strands' folders if applicable), like this example:
```
guppy_output_folder/
--- pass/
| ---barcode01/
| | fastq_runid_abc_0_0.fastq
| ---unclassified/
| | fastq_runid_abc_0_0.fastq
--- fail/
| ---unclassified
| | fastq_runid_abc_0_0.fastqGuppy will not empty the save path before writing the output, but it will overwrite existing FASTQ files.
Nested output folders
Guppy also supports an alternative output folder structure, designed to match that produced by MinKNOW. This can be enabled using the command line switch
--nested_output_folder
. When enabled, Guppy will further organise the output subfolders as follows:guppy_output_folder/
--- {protocol_group_id} (if it exists is source fast5 files)/
| ---{sample_id}/
| | ---{experiment_start_time}_{device_id}_{flow_cell_id}_{protocol_run_id}/
| | | ---fastq_pass or fastq_fail/
| | | | ---{barcode classification} (if it exists for the read, otherwise this folder is absent)
An alternative nested folder output is available which is very similar to the above, but places the barcode classification directly under the protocol group id. This scheme can be enabled using the command line switch
--barcode_nested_output_folder
. The folders are organised as follows:guppy_output_folder/
--- {protocol_group_id} (if it exists is source fast5 files)/
| ---{barcode classification}/ (if it exists for the read, otherwise this folder is absent)
| | ---{sample_id}/
| | | ---{experiment_start_time}_{device_id}_{flow_cell_id}_{protocol_run_id}/
| | | | ---fastq_pass or fastq_fail
-
-
Ping information
Guppy collects high-level summary information when it is used, and by default this information is sent over your internet connection to Oxford Nanopore Technologies. This is important information that allows us to analyse the performance of Guppy and identify areas where we need to improve. Nothing specific about the genomic content of individual reads is included - only generic information is logged, such as sequence length and q-score, aggregated over all the reads processed by Guppy. The sending of this summary information can be turned off if desired by providing the
--disable_pings
option to Guppy.Guppy collects this high-level summary information as follows:
- Individual reads are added to an aggregator as they are basecalled
- The summary ping(s) are written out to a file (.js)
- If not disabled, the summary ping(s) are sent to Oxford Nanopore
This type of information is collected:
- General information about the configuration of Guppy and the run(s) that the data came from:
- the options provided to Guppy
- the total number of reads seen, and those seen per channel
- Basecalling information:
- the numbers of reads which passed or failed basecalling
- the average sequence length
- the distribution of mean q-scores
- the distribution of basecalling speeds
Users are encouraged to browse the
summary_telemetry.js
file if they wish to see exactly what information Guppy is aggregating for telemetry. -
Summary file contents
Guppy produces a summary file named
sequencing_summary.txt
during basecalling, which contains high-level information on every read analysed by the basecaller. This file is a tab-delimited text file which can be imported into common spreadsheet applications such as Excel or LibreOffice Calc, or read by software libraries such as NumPy or Pandas. Every read that is sent to the basecaller will have an entry in the summary file, regardless of whether or not that read was successfully basecalled.When enabling extra functionality such as barcoding or alignment, additional columns will be added to the summary file. For this reason, and because the columns may occasionally be re-ordered, it is recommended that specific columns are accessed by their name (e.g. the
read_id
column) instead of the order in which they occur in the file.Below is a list of summary file columns with a description of their contents. Very occasionally new columns may be added to the file without being described here; these columns should be considered unreliable and subject to change or removal.
- filename The name of the input read file the read came from.
- read_id The uuid that uniquely identifies this read.
- parent_read_id The uuid that uniquely identifies the original input read from which this read was generated. This column will only be present if
--do_read_splitting
is enabled. For unsplit reads, this value will be identical to read_id. - run_id The uuid that uniquely identifies the sequencing run that this read came from.
- batch_id Integer identifier of the batch that Guppy put this read in. See the
--read_batch_size
parameter and the--resume
option. - channel The channel on the flow cell that the read came from.
- mux The mux in the channel that the read came from.
- start_time Start time of the read, in seconds since the beginning of the run.
- duration Duration of the read, in seconds.
- minknow_events The number of events detected by MinKNOW. Defaults to zero if unknown, or if the value cannot be determined due to read-splitting.
- passes_filtering Whether or not the read passed the qscore and alignment filters (the value is not affected by the
--disable_qscore_filtering
and--alignment_filtering
flags). See the--min_qscore
parameter. - template_start Start time of the portion of the read that was sent to the basecaller after adapter trimming, in seconds since the beginning of the run. See the
--trim_threshold
,--trim_min_events
,--max_search_len
,--trim_strategy
, and--dmean_win_size
parameters. - num_events_template Legacy field -- template_duration should be used instead.
- template_duration Duration of the portion of the read that was sent to the basecaller after adapter trimming, in seconds.
- sequence_length_template Number of bases in the output sequence, taking into account any sequence trimming. See "Barcode trimming".
- mean_qscore_template The qscore corresponding to the mean error rate of the sequence.
- strand_score_template Legacy field - no longer populated reliably.
- median_template The median current of the read, in pA.
- scaling_median_template The "median_template" value used by the basecaller to scale incoming data. May be different than median_template if adapter scaling or scaling overrides are used. See the
--scaling_med
parameter. - scaling_mad_template The "mad_template" value used by the basecaller to scale incoming data. May be different than mad_template if adapter scaling or scaling overrides are used. See the
--scaling_mad
parameter.
If barcoding/demultiplexing is enabled via the
--barcode_kits
argument, then the following columns are added to the sequencing summary file:- barcode_arrangement The normalized name of the barcode classification, without a kit (e.g. "barcode01"), or "unclassified" if no classification could be made.
- barcode_full_arrangement The full name for the highest-scoring barcode match, including kit, variation, and direction (e.g. "RAB19_var2").
- barcode_kit The kit name belonging to the highest-scoring barcode match (e.g. "RAB").
- barcode_variant Which of the forward / reverse variants the highest-scoring barcode matched (e.g. "var1"), or "n/a" if no variants are available.
- barcode_score The score for either the front or rear barcode, whichever is higher. The maximum score is 100, with no minimum.
- barcode_front_id The full name for the barcode at the front of the strand, including direction (forward/reverse) and variant (1st/2nd) (e.g. "RAB19_2nd_FWD").
- barcode_front_score The score for the barcode at the front of the strand.
- barcode_front_refseq The reference sequence the barcode at the front of the strand was matched against.
- barcode_front_foundseq The sequence of the barcode at the front of the strand that matched
barcode_front_refseq
. - barcode_front_foundseq_length The length of
barcode_front_foundseq
. - barcode_front_begin_index The position in the called sequence, counting from the beginning, that
barcode_front_foundseq
begins at. - barcode_rear_score The score for the barcode at the rear of the strand.
- barcode_rear_refseq The reference sequence the barcode at the rear of the strand was matched against.
- barcode_rear_foundseq The sequence of the barcode at the rear of the strand that matched
barcode_rear_refseq
. - barcode_rear_foundseq_length The length of
barcode_rear_foundseq
. - barcode_rear_end_index The position in the called sequence, counting backwards from the end, that
barcode_rear_foundseq
ends at.
If dual barcoding is used the following additional columns will be present:
- barcode_front_id_inner
- barcode_front_score_inner
- barcode_rear_id_inner
- barcode_rear_score_inner
These columns have the same meaning as the standard "id" and "score" columns above, but apply only to the inner front and rear barcodes. The standard "id" and "score" columns now apply to the outer barcodes.
For further details on how barcoding works see the "Barcoding/demultiplexing" section.
If LamPORE detection is enabled via the
--lamp_detect
argument, the following additional columns will be present:- lamp_barcode_id The normalized name of the LAMP FIP barcode classification, (e.g. "FIP01"), or "unclassified" if no classification could be made.
- lamp_barcode_score The alignment score for the best-scoring LAMP FIP barcode. Note that if the best score is below the threshold specified by
--min_score_lamp
, the score will still be reported here, although the classification will be "unclassified". - lamp_target_id The target name of the LAMP target classification (e.g. "ACTB"), or "unclassified if no classification could be made.
- lamp_target_score The alignment score for the best-scoring LAMP target. Note that if the best score is below the threshold specified by
--min_score_lamp_target
, the score will still be reported here, although the classification will be "unclassified".
If adapter detection is enabled via the
--detect_adapter
argument, the following additional columns will be present:- adapter_front_id The name of the adapter (if any) found at the front of the strand. This will be "unclassified" if no adapter was found.
- adapter_front_score The alignment score of the adapter at the front of the strand. If unclassified this will be the score that was highest among the rejected sequences.
- adapter_front_begin_index The position in the called sequence of the beginning of the adapter, counting from the beginning of the strand.
- adapter_front_foundseq_length The length of the portion of the strand that aligned to adapter.
- adapter_rear_id The name of the adapter (if any) found at the rear of the strand. This will be "unclassified" if no adapter was found.
- adapter_rear_score The alignment score of the adapter at the rear of the strand. If unclassified this will be the score that was highest among the rejected sequences.
- adapter_rear_end_index The position in the called sequence of the end of the adapter, counting from the end of the strand.
- adapter_rear_foundseq_length The length of the portion of the strand that aligned to adapter.
If primer detection is enabled via the
--detect_primer
argument, the following additional columns will be present:- primer_front_id The name of the primer (if any) found at the front of the strand. This will be "unclassified" if no primer was found.
- primer_front_score The alignment score of the primer at the front of the strand. If unclassified, this will be the score that was highest among the rejected sequences.
- primer_front_begin_index The position in the called sequence of the beginning of the primer, counting from the beginning of the strand.
- primer_front_foundseq_length The length of the portion of the strand that aligned to primer.
- primer_rear_id The name of the primer (if any) found at the rear of the strand. This will be "unclassified" if no primer was found.
- primer_rear_score The alignment score of the primer at the rear of the strand. If unclassified, this will be the score that was highest among the rejected sequences.
- primer_rear_end_index The position in the called sequence of the end of the primer, counting from the end of the strand.
- primer_rear_foundseq_length The length of the portion of the strand that aligned to primer.
If barcode trimming is enabled via
--enable_trim_barcodes
, or adapter or primer trimming is enabled via thetrim_adapters
, ortrim_primers
arguments, the following additional columns will also be present:- front_total_trimmed The number of bases removed from the front of the sequence as part of trimming.
- rear_total_trimmed The number of bases removed from the rear of the sequence as part of trimming.
If alignment is enabled via the
--align_ref
argument, then the following colums are added to the sequencing summary file:- alignment_genome The name of the reference which the read aligned to, or "*" if no alignment was found.
- alignment_genome_start The position in the reference where the alignment started, or 0 if no alignment was found.
- alignment_genome_end The position in the reference where the alignment ended, or 0 if no alignment was found.
- alignment_strand_start The position in the called sequence where the alignment started, or 0 if no alignment was found.
- alignment_strand_end The position in the called sequence where the alignment ended, or 0 if no alignment was found.
- alignment_num_insertions The number of insertions in the alignment, or -1 if no alignment was found.
- alignment_num_deletions The number of deletions in the alignment, or -1 if no alignment was found.
- alignment_num_aligned The number of bases in the called sequence which aligned to bases in the reference, or -1 if no alignment was found.
- alignment_num_correct The number of aligned bases in the called sequence which match their corresponding reference base, or -1 if no alignment was found.
- alignment_identity The percentage of aligned bases which correctly match their corresponding reference base (alignment_num_correct/alignment_num_aligned), or -1 if no alignment was found.
- alignment_accuracy The percentage of all bases in the alignment which are correct (alignment_num_correct/(alignment_num_aligned + alignment_num_insertions + alignment_num_deletions)), or -1 if no alignment was found.
- alignment_score The score returned by minimap2, or -1 if no alignment was found.
- alignment_coverage The percentage of either the called sequence or the reference (whichever is shorter) that aligns (e.g. (alignment_strand_end - alignment_strand_start + 1)/(sequence_length_template), or -1 if no alignment was found.
- alignment_direction The direction of the alignment, either forwards (+) or reverse (-), or "*" if no alignment was found. Note that genome positions (e.g. alignment_genome_start) are always given in the forwards direction.
- alignment_mapping_quality The mapping quality of the alignment. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
- alignment_num_alignments The total number of alignments found. This will be zero if no alignment was found.
- alignment_num_secondary_alignments The number of alignments that were flagged by minimap2 as secondary alignments.
- alignment_num_supplementary_alignments The number of alignments that were flagged by minimap2 as supplementary alignments.