Larger Pangenomes

Although a nice example, the settings for the small, highly-diverse DRB1-3123 gene in the human HLA are typically too sensitive when building whole genomes. # TODO add link to first quick start example.

In practice, we usually need to set -s much higher, up to 50000 or 100000 depending on context, to ensure that the resulting graphs maintain a structure reflective of the underlying homology of large regions of the genome, and not spurious matches caused by small repeats.

To ensure that we only get high-quality alignments, we might need to set -p higher, near the expected pairwise diversity of the sequences we're using (including structural variants in the diversity metric). In general, increasing -s, and -p decreases runtime and memory usage. However, it also decreases the compression of the graph.

For instance, a good setting for 10-20 genomes from the same species, with diversity from 1-5% would be -s 100000 -p 90 -n 10. However, if we wanted to include genomes from another species with higher divergence (say 20%), we might use -s 100000 -p 70 -n 10. The exact configuration depends on the application, and testing must be used to determine what is appropriate for a given study.

When abPOA digests very complex and deep blocks, it might consume a huge amount of memory. This can be addressed with -T to specifically control the number of threads during the POA step. This leads to a lower memory consumption.