-
Modified base calling
It is now possible to use Guppy to identify certain types of modified bases. This requires the use of a specific basecalling model which is trained to identify one or more types of modification. Configuration files for these new models can generally be identified by the inclusion of "modbases" in their name (e.g.
dna_r9.4.1_450bps_modbases_5mc_hac.cfg
). The tokens following "modbases" will generally provide information about the type of modifications that will
be looked for. For example, "5mc_cg" indicates that it will look for 5mC modifications in a CG context.Modified base call results can currently be stored in BAM files. BAM file output (
--bam_out
) will automatically be enabled if a modified basecall model is detected in the configuration. It is also possible to extract the raw modified base information from a called read via the Guppy client API in C++ or Python. Note that to get back modified base information via the Guppy client API, move and trace data must be enabled (see the API documentation for more details).Raw modified base table format
The raw modified table (as available via the client API) is a two-dimensional array, where each row of the table relates to the corresponding base in the associated canonical sequence. For example, the first row of the table (row 0) will correspond to the first base in the canonical basecall sequence.
Each row contains a number of columns equal to the number of canonical bases (four) plus the number of modifications present in the model. The columns list the bases in alphabetical order (ACGT for DNA, ACGU for RNA), and each base is immediately followed by columns corresponding to the modifications that apply to that particular base. For example, with a model that identified modifications for 6mA and 5mC, the column ordering would be A 6mA C 5mC G T.
Each table row describes the likelihood that, given that a particular base was called at that position, that that base is either a canonical one (i.e. a base that the model considers to be "unmodified"), or one of the modifications that is contained within the model. The contents of the table are integers in the range of 0-255, which represent likelihoods in the range of 0-100% (storing these values as integers allows us to reduce .fast5 file size). For example, a likelihood of 100% corresponds to a table entry of 255. Within a given row the table entries for a particular base will sum to 100%.
Following from our previous example with 6mA and 5mC, you might see a table with row entries like these:
[63, 192, 0, 0, 0, 0],
[0, 0, 255, 0, 0, 0],
[0, 0, 0, 0, 255, 0],
[0, 0, 0, 0, 0, 255],
This would mean that:- An A was called for the first base, and the likelihood that it is a canonical A is ~25% (63/192), and the likelihood that it is 6mA is ~75% (192 / 255).
- A C was called for the second base, and the likelihood that it is a canonical C is 100% (255 / 255), with no chance (0 / 255) of it being a 5mC.
- A G and then a T were called for the third and fourth bases. The likelihood that they are canonical bases is 100% (255 / 255 -- this should always be the case, as the model does not include any modification states for G or T).
Note that for the current modified base models in Guppy, the likelihoods will all be 0 for the bases that were not called, since the modification detection is performed after determining the called sequence. This was not the case for previous versions of the software, which used a different method to determine the probabilities.
BAM file modified base format
If a modified base call model is selected, Guppy will emit BAM files as if the
--bam_out
flag had been set. Modified bases will be encoded into the BAM modified base format in the metadata tagsMM
andML
. For configurations that only look for modifications within a specific context (which is currently the case for all of our suppored modified base configurations), a ? will be used in the MM tag to indicate that the modification probability is unknown for any bases of the specified type that were skipped, and results will only be output for bases that match the context. If any context-free modification configurations are used, then the ? will not appear in the tag, and only instances of the base that exceed the specified threshold will be output. For more information on the BAM modified base format, see the "Base Modifications" section of the SAM optional fields specification here: https://samtools.github.io/hts-specs/SAMtags.pdf