Community

Alignment overview

The Guppy toolchain provides the guppy_aligner executable to allow users to perform reference genome alignment on basecalled reads. Alignment is performed against the supplied reference via an integrated minimap2 aligner, full details of which can be found: https://github.com/lh3/minimap2. To perform alignment, invoke the Guppy aligner with the minimum required parameters:

guppy_aligner --input_path <folder containing input files> --save_path <output folder> --align_ref <reference FASTA>

The input path will be searched for input FASTQ, FASTA, SAM and BAM files to perform alignment on. The align_ref is used to specify the reference genome. Sequences from a SAM or BAM file that have been stored as the reverse complement will be reverse-complemented before alignment in order to ensure the same results are produced when realigning the output file with the same options. When performing alignment, the Guppy aligner creates the following files in the output folder:
- alignment_summary.txt: Contains information about the best-quality alignment result for each read, such as alignment start, end, accuracy, etc. See "Summary file contents" in the "Input and output files" section for details.
- read_processor_log-\<date and time\>.log: A log file with information about the execution run.
- .sam or .bam: A SAM or BAM file is produced for each corresponding input file located in the input folder. If a successful alignment is found which passes the coverage filter, the SAM/BAM file will contain a CIGAR string representing the alignment. The default alignment coverage required to consider a result successful is 60%. If BAM file output is enabled, BAM files will be sorted by reference ID and then the leftmost coordinate.
Guppy aligner supports the following optional parameters:
- Version (--version): Prints the version of Guppy aligner.
- Help (-h or --help): Print a help message describing usage and all the available parameters.
- Quiet mode (-z or --quiet): This option prevents the Guppy basecaller from outputting anything to stdout. Stdout is short for “standard output” and is the default location to which a running program sends its output. For a command line executable, stdout will typically be sent to the terminal window from which the program was run.
- Verbose logging (--verbose_logs): Flag to enable verbose logging (outputting a verbose log file, in addition to the standard log files, which contains detailed information about the application). Off by default.
- Worker thread count (-t or --worker_threads): The number of worker threads to spawn for the aligner to use. Increasing this number will allow Guppy aligner to make better use of multi-core CPU systems, but may impact overall system performance.
- Recursive (-r or --recursive): search through all subfolders contained in the --input_path value, and perform alignment on any .fastq, .fq, .fasta or .fa files found in them.
- BAM file output (--bam_out): This flag enables BAM file output. If the flag is not present, guppy_aligner defaults to SAM output.
- BAM file indexing (--index): This flag enables BAM file indexing. If the flag is present, gupply_aligner sorts the BAM file output and generates the BAI index file. This flag requires that --bam_out is also set. Disabled by default.
- Minimap options (--minimap_opt_string): This flag allows to specify alignment options for the inner minimap2 alignment algorithm, using the same flags and format supported by the minimap2 program. See #supported-minimap2-options for the list of supported flags.
- Max records per output file (-q or --records_per_file): The maximum number of records to put in a single SAM or BAM file. Set this to zero to allow unlimited records per file. Note: setting to zero will have a performance impact due to holding all the records in memory until writing to disk. The default value is 4000.
- Perform read filtering based on alignment (--alignment_filtering): This flag allows reads to be filtered based on their alignment status. Reads with alignment results will be written to the pass folder, and unaligned reads to the fail folder.
- BED file (--bed_file): Path to .bed file containing areas of interest in reference genome. The emitted alignment_summary file will contain a column of alignment_bed_hits for the regions of interest.
- Alignment type (--align_type): Specify whether you want full or coarse alignment. Valid values are (auto/full/coarse).
- Progress stats reporting frequency (--progress_stats_frequency): Frequency in seconds in which to report progress statistics, if supplied will replace the default progress display.
- Trace catagory logs (--trace_category_logs): Enable trace logs - list of strings with the desired names.
- Trace domains config (--trace_domains_config): Configuration file containing list of trace domains to include in verbose logging (if enabled)
- Disable pings (--disable_pings): Flag to disable sending any telemetry information to Oxford Nanopore Technologies. See the "Ping information" section for a summary of what is included in the Guppy telemetry.
- Telemetry URL (--ping_url): Override the default URL for sending telemetry pings.
- Ping segment duration (--ping_segment_duration): Duration in minutes of each ping segment.
If the aligner reports more than one possible alignment, only the best one is output. An alignment that covers less than 60% of the read or of the reference will be rejected.

Index files produced by the bwa aligner should also work as an align_ref but are not explicitly supported.

The integrated minimap2 aligner is run with no additional arguments supplied to it - the default values are used for all alignments. It is not possible to modify the arguments at this time.

The minimap library integration Oxford Nanopore uses is available on our GitHub page here: http://github.com/nanoporetech/ont_minimap2

For more explanation of alignment-related columns output in the sequencing summary file, please refer to the Input and output files section of this protocol.
Supported minimap2 options

The list of flags currently supported by the --minimap_opt_string it is possible to run:

<guppy_executable> --minimap_opt_string --help

In the list of flags below NUM represents an integer in human-readable format, e.g. 4000 can be specified as 4k.
The default value for each option is reported in square brackets after its description.
- Indexing flags:
  - -H [ --hpc ] use homopolymer-compressed k-mer
  - -k [ --kmer-size ] INT k-mer size (no larger than 28) [15]
  - -w [ --window-size ] INT minimiser window size [10]
  - -I [ --batch-size ] NUM split index for every ~NUM input bases [4G]
- Mapping flags:
  - -f [ --mid-occ-frac ] FLOAT filter out top FLOAT fraction of repetitive minimisers [0.0002]
  - -g [ --max-gap ] NUM stop chain enlongation if there are no minimisers in INT-bp [5000]
  - -G [ --max-intron-len ] NUM max intron length (effective with -xsplice + changing -r) [200k]
  - -F [ --max-frag-len ] NUM max fragment length (effective with -xsr or in the fragment mode) [0]
  - -r [ --bandwidth ] NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]
  - -n [ --min-count ] INT minimal number of minimisers on a chain [3]
  - -m [ --min-chain-score ] INT minimal chaining score (matching bases minus log gap penalty) [40]
  - -X [ --skip-self-dual ] skip self and dual mappings (for the all-vs-all mode)
  - -p [ --pri-ratio ] FLOAT min secondary-to-primary score ratio [0.8]
  - -N [ --best-n ] INT retain at most INT secondary alignments [5]
- Alignment flags:
  - -A [ --match ] INT matching score [2]
  - -B [ --mismatch ] INT mismatch penalty (larger value for lower divergence) [4]
  - -O [ --gap-open ] INT[,INT] gap open penalty [4,24]
  - -E [ --gap-extension ] INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]
  - -z [ --z-drop ] INT[,INT] Z-drop score and inversion Z-drop score [400,200]
  - -s [ --min-dp-score ] INT minimal peak DP alignment score [80]
  - -u [ --gt-ag ] CHAR how to find GT-AG. f:transcript strand, b:both strands, n:do not match GT-AG [n]
- Input/Output flags:
  - -L [ --long-cigar ] write CIGAR with >65535 ops at the CG tag
  - -c [ --cg ] output CIGAR in PAF
  - --cs arg output the cs tag; STR is 'short' (if absent) or 'long' [none]
  - --MD output the MD tag
  - --eqx write =/X CIGAR operators
  - -Y [ --softclip ] use soft clipping for supplementary alignments
  - -t [ --threads ] INT number of threads [1]
  - -K [ --mb-size ] NUM minibatch size for mapping [500M]
  - -V [ --version ] show version number
- Preset flags:
  - -x [ --preset ] STR preset (always applied before other options; see man minimap2.1 for details) []
  - map-pb/map-ont - PacBio CLR/Nanopore vs reference mapping
  - map-hifi - PacBio HiFi reads vs reference mapping
  - ava-pb/ava-ont - PacBio/Nanopore read overlap
  - asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5%% sequence divergence
  - splice/splice:hq - long-read/Pacbio-CCS spliced alignment
  - sr - genomic short-read mapping
- Unsupported flags:
  - -d [ --dump-index ] FILE dump index to FILE []
  - -a [ --sam ] output in the SAM format (PAF by default)
  - -o [ --output ] FILE output alignments to FILE [stdout]
  - -R [ --rg ] STR SAM read group line in a format like '@RG\tID:foo\tSM:bar' []
See man minimap2.1 for detailed description of these and other advanced command-line options.
Alignment index files

When aligning to large references (≥100 Mb) it is recommended to prepare an index file in advance for performance (to avoid generating the index during each run).

To create a minimap2 index file:
1. Download and install the minimap2 tool from: https://github.com/lh3/minimap2
2. Run the command:
minimap2 <input.fasta> <output.idx> -I 32G

-I 32G indicates the size of reference in bases before sharding occurs - this should be set to be larger than your reference length.

Sharding: In minimap2, by default, infers references greater than 4 Gb are split into 'shards' within the index in order to reduce RAM usage. The strand is then aligned separately against each reference shard which can lead to Guppy returning an incorrect alignment, if the strand aligns to a reference that is not within the first shard.
File conversion

The Guppy toolchain provides the bam_convert executable to convert files between the SAM and BAM formats. To convert a file, invoke bam_convert with the minimum required parameters:

bam_convert --input [input file name] --save [output filename]

bam_convert can also be used to merge multiple files into one by specifying a directory containing one or more BAM files and using the --merge flag:

bam_convert --input [input file path] --save [output filename] --merge

The input directory will be searched for BAM files, and the contents merged into a single output file.

bam_convert also supports the following optional parameters:
- Help (-h or --help): Print a help message describing usage and all the available parameters.
- Sort (--sort): Sort the records in the exported file by reference ID and then the leftmost coordinate.
- Recursive (-r or --recursive): When performing a merge, search the input directory recursively for input files.
- Index (--index): Generate an index file for the output BAM file.
- Merge header (--merge_headers): Regenerate IDs for program group and read group tags to prevent clashes. If this option is omitted, bam_convert will use only the headers from the first file to be merged. This option is only valid when --merge is also present.

Discover nanopore sequencing

Explore products

Research

Techniques

Focus areas

Company

News & Events

Global partners

Alignment

Discover nanopore sequencing

Explore products

Discover nanopore sequencing

Explore products

Research

Techniques

Focus areas

Research

Techniques

Focus areas

Company

News & Events

Global partners

Company

News & Events

Global partners

NCM 2024: Boston

Alignment

Cookies Notice