Larger Pangenomes

Although a nice example (:ref:`quick_start_example), the settings for the small, highly-diverse DRB1-3123 gene in the human HLA are typically too sensitive when building whole genomes.)

In practice, we usually need to keep -s in the 5-20kbp range, depending on context, to ensure that the resulting graphs maintain a structure reflective of the underlying homology of large regions of the genome, and not spurious matches caused by small repeats.

To ensure that we only get high-quality alignments, we might need to set -p higher, near the expected pairwise diversity of the sequences we're using (including structural variants in the diversity metric). In general, increasing -s, and -p decreases runtime and memory usage. However, it also decreases the compression of the graph.

For instance, a good setting for 10-20 genomes from the same species, with diversity from 1-5% would be -s 10000 -p 95 [-n 10]. However, if we wanted to include genomes from another species with higher divergence (say 20%), we might use -s 10000 -p 70 -n 10. The exact configuration depends on the application, and testing must be used to determine what is appropriate for a given study.

When SPOA/abPOA digests very complex and deep blocks, it might consume a huge amount of memory. This can be addressed with -T to specifically control the number of threads during the POA step. This leads to a lower memory consumption.