-
Alignment overview
The Guppy toolchain provides the
guppy_aligner
executable to allow users to perform reference genome alignment on basecalled reads. Alignment is performed against the supplied reference via an integrated minimap2 aligner, full details of which can be found: https://github.com/lh3/minimap2. To perform alignment, invoke the Guppy aligner with the minimum required parameters:
guppy_aligner --input_path <folder containing input files> --save_path <output folder> --align_ref <reference FASTA>
The input path will be searched for input FASTQ, FASTA, SAM and BAM files to perform alignment on. The
align_ref
is used to specify the reference genome. Sequences from a SAM or BAM file that have been stored as the reverse complement will be reverse-complemented before alignment in order to ensure the same results are produced when realigning the output file with the same options. When performing alignment, the Guppy aligner creates the following files in the output folder:-
alignment_summary.txt
: Contains information about the best-quality alignment result for each read, such as alignment start, end, accuracy, etc. See "Summary file contents" in the "Input and output files" section for details. -
read_processor_log-\<date and time\>.log
: A log file with information about the execution run. -
.sam
or.bam
: A SAM or BAM file is produced for each corresponding input file located in the input folder. If a successful alignment is found which passes the coverage filter, the SAM/BAM file will contain a CIGAR string representing the alignment. The default alignment coverage required to consider a result successful is 60%. If BAM file output is enabled, BAM files will be sorted by reference ID and then the leftmost coordinate.
Guppy aligner supports the following optional parameters:
- Version (
--version
): Prints the version of Guppy aligner. - Help (
-h
or--help
): Print a help message describing usage and all the available parameters. - Quiet mode (
-z
or--quiet
): This option prevents the Guppy basecaller from outputting anything to stdout. Stdout is short for “standard output” and is the default location to which a running program sends its output. For a command line executable, stdout will typically be sent to the terminal window from which the program was run. - Verbose logging (
--verbose_logs
): Flag to enable verbose logging (outputting a verbose log file, in addition to the standard log files, which contains detailed information about the application). Off by default. - Worker thread count (
-t
or--worker_threads
): The number of worker threads to spawn for the aligner to use. Increasing this number will allow Guppy aligner to make better use of multi-core CPU systems, but may impact overall system performance. - Recursive (
-r
or--recursive
): search through all subfolders contained in the--input_path
value, and perform alignment on any .fastq, .fq, .fasta or .fa files found in them. - BAM file output (
--bam_out
): This flag enables BAM file output. If the flag is not present,guppy_aligner
defaults to SAM output. - BAM file indexing (
--index
): This flag enables BAM file indexing. If the flag is present,gupply_aligner
sorts the BAM file output and generates the BAI index file. This flag requires that--bam_out
is also set. Disabled by default. - Minimap options (
--minimap_opt_string
): This flag allows to specify alignment options for the inner minimap2 alignment algorithm, using the same flags and format supported by theminimap2
program. See #supported-minimap2-options for the list of supported flags. - Max records per output file (
-q
or--records_per_file
): The maximum number of records to put in a single SAM or BAM file. Set this to zero to allow unlimited records per file. Note: setting to zero will have a performance impact due to holding all the records in memory until writing to disk. The default value is 4000. - Perform read filtering based on alignment (
--alignment_filtering
): This flag allows reads to be filtered based on their alignment status. Reads with alignment results will be written to the pass folder, and unaligned reads to the fail folder. - BED file (
--bed_file
): Path to .bed file containing areas of interest in reference genome. The emitted alignment_summary file will contain a column ofalignment_bed_hits
for the regions of interest. - Alignment type (
--align_type
): Specify whether you want full or coarse alignment. Valid values are (auto/full/coarse). - Progress stats reporting frequency (
--progress_stats_frequency
): Frequency in seconds in which to report progress statistics, if supplied will replace the default progress display. - Trace catagory logs (
--trace_category_logs
): Enable trace logs - list of strings with the desired names. - Trace domains config (
--trace_domains_config
): Configuration file containing list of trace domains to include in verbose logging (if enabled) - Disable pings (
--disable_pings
): Flag to disable sending any telemetry information to Oxford Nanopore Technologies. See the "Ping information" section for a summary of what is included in the Guppy telemetry. - Telemetry URL (
--ping_url
): Override the default URL for sending telemetry pings. - Ping segment duration (
--ping_segment_duration
): Duration in minutes of each ping segment.
If the aligner reports more than one possible alignment, only the best one is output. An alignment that covers less than 60% of the read or of the reference will be rejected.
Index files produced by the bwa aligner should also work as an
align_ref
but are not explicitly supported.The integrated minimap2 aligner is run with no additional arguments supplied to it - the default values are used for all alignments. It is not possible to modify the arguments at this time.
The minimap library integration Oxford Nanopore uses is available on our GitHub page here: http://github.com/nanoporetech/ont_minimap2
For more explanation of alignment-related columns output in the sequencing summary file, please refer to the Input and output files section of this protocol.
-
-
Supported minimap2 options
The list of flags currently supported by the
--minimap_opt_string
it is possible to run:
<guppy_executable> --minimap_opt_string --help
In the list of flags below
NUM
represents an integer in human-readable format, e.g. 4000 can be specified as 4k.
The default value for each option is reported in square brackets after its description.Indexing flags:
-H [ --hpc ]
use homopolymer-compressed k-mer-k [ --kmer-size ] INT
k-mer size (no larger than 28) [15]-w [ --window-size ] INT
minimiser window size [10]-I [ --batch-size ] NUM
split index for every ~NUM input bases [4G]
Mapping flags:
-f [ --mid-occ-frac ] FLOAT
filter out top FLOAT fraction of repetitive minimisers [0.0002]-g [ --max-gap ] NUM
stop chain enlongation if there are no minimisers in INT-bp [5000]-G [ --max-intron-len ] NUM
max intron length (effective with -xsplice + changing -r) [200k]-F [ --max-frag-len ] NUM
max fragment length (effective with -xsr or in the fragment mode) [0]-r [ --bandwidth ] NUM[,NUM]
chaining/alignment bandwidth and long-join bandwidth [500,20000]-n [ --min-count ] INT
minimal number of minimisers on a chain [3]-m [ --min-chain-score ] INT
minimal chaining score (matching bases minus log gap penalty) [40]-X [ --skip-self-dual ]
skip self and dual mappings (for the all-vs-all mode)-p [ --pri-ratio ] FLOAT
min secondary-to-primary score ratio [0.8]-N [ --best-n ] INT
retain at most INT secondary alignments [5]
Alignment flags:
-A [ --match ] INT
matching score [2]-B [ --mismatch ] INT
mismatch penalty (larger value for lower divergence) [4]-O [ --gap-open ] INT[,INT]
gap open penalty [4,24]-E [ --gap-extension ] INT[,INT]
gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]-z [ --z-drop ] INT[,INT]
Z-drop score and inversion Z-drop score [400,200]-s [ --min-dp-score ] INT
minimal peak DP alignment score [80]-u [ --gt-ag ] CHAR
how to find GT-AG. f:transcript strand, b:both strands, n:do not match GT-AG [n]
Input/Output flags:
-L [ --long-cigar ]
write CIGAR with >65535 ops at the CG tag-c [ --cg ]
output CIGAR in PAF--cs arg
output the cs tag; STR is 'short' (if absent) or 'long' [none]--MD
output the MD tag--eqx
write =/X CIGAR operators-Y [ --softclip ]
use soft clipping for supplementary alignments-t [ --threads ] INT
number of threads [1]-K [ --mb-size ] NUM
minibatch size for mapping [500M]-V [ --version ]
show version number
Preset flags:
-x [ --preset ] STR
preset (always applied before other options; seeman minimap2.1
for details) []map-pb/map-ont
- PacBio CLR/Nanopore vs reference mappingmap-hifi
- PacBio HiFi reads vs reference mappingava-pb/ava-ont
- PacBio/Nanopore read overlapasm5/asm10/asm20
- asm-to-ref mapping, for ~0.1/1/5%% sequence divergencesplice/splice:hq
- long-read/Pacbio-CCS spliced alignmentsr
- genomic short-read mapping
Unsupported flags:
-d [ --dump-index ] FILE
dump index to FILE []-a [ --sam ]
output in the SAM format (PAF by default)-o [ --output ] FILE
output alignments to FILE [stdout]-R [ --rg ] STR
SAM read group line in a format like '@RG\tID:foo\tSM:bar' []
See
man minimap2.1
for detailed description of these and other advanced command-line options. -
Alignment index files
When aligning to large references (≥100 Mb) it is recommended to prepare an index file in advance for performance (to avoid generating the index during each run).
To create a minimap2 index file:
- Download and install the minimap2 tool from: https://github.com/lh3/minimap2
- Run the command:
minimap2 <input.fasta> <output.idx> -I 32G
-I 32G
indicates the size of reference in bases before sharding occurs - this should be set to be larger than your reference length.Sharding: In minimap2, by default, infers references greater than 4 Gb are split into 'shards' within the index in order to reduce RAM usage. The strand is then aligned separately against each reference shard which can lead to Guppy returning an incorrect alignment, if the strand aligns to a reference that is not within the first shard.
-
File conversion
The Guppy toolchain provides the
bam_convert
executable to convert files between the SAM and BAM formats. To convert a file, invokebam_convert
with the minimum required parameters:
bam_convert --input [input file name] --save [output filename]
bam_convert
can also be used to merge multiple files into one by specifying a directory containing one or more BAM files and using the--merge
flag:
bam_convert --input [input file path] --save [output filename] --merge
The input directory will be searched for BAM files, and the contents merged into a single output file.bam_convert
also supports the following optional parameters:- Help (
-h
or--help
): Print a help message describing usage and all the available parameters. - Sort (
--sort
): Sort the records in the exported file by reference ID and then the leftmost coordinate. - Recursive (
-r
or--recursive
): When performing a merge, search the input directory recursively for input files. - Index (
--index
): Generate an index file for the output BAM file. - Merge header (
--merge_headers
): Regenerate IDs for program group and read group tags to prevent clashes. If this option is omitted,bam_convert
will use only the headers from the first file to be merged. This option is only valid when--merge
is also present.
- Help (