Divergence estimation

Synopsis

pggb exposes parameters that allow users to influence the structure of the graph that will represents the input sequences. In particular, reducing the mapping identity (-p parameter) increases the sensitivity of the alignment, leading to more compressed graphs. It is recommended to change this parameter depending on how divergent are the input sequences.

Steps

Here we show how to estimate the sequence divergence of 7 Saccharomyces cerevisiae assemblies (Yue et al., 2016) from the Yeast Population Reference Panel (YPRP).

Download the assemblies

Assuming that your current working directory is the root of the pggb repository, to download the YPRP panel, execute:

mkdir -p assemblies/yprp_panel
cd assemblies/yprp_panel
cat ../../docs/data/scerevisiae.yprp.urls | parallel -j 4 'wget -q {} && echo got {}'

For each sample, to put genome and mitochondrial sequences together, execute:

ls *.fa.gz | cut -f 1 -d '.' | uniq | while read f; do
    echo $f
    zcat $f.* > $f.fa
done

Pangenome Sequence Naming

To change the sequence names according to PanSN-spec, we use fastix:

ls *.fa | while read f; do
    sample_name=$(echo $f | cut -f 1 -d '.');
    echo ${sample_name}
    fastix -p "${sample_name}#1#" $f >> scerevisiae7.fasta
done
bgzip -@ 4 scerevisiae7.fasta
samtools faidx scerevisiae7.fasta.gz

We specify haplotype_id equals to 1 for all the assemblies, as they are all haploid.

Sequence divergence

Assuming we will work with one chromosome at a time, we estimate the sequence divergence for each set of chromosomes. To partition the sequences by chromosome, execute:

cut -f 1 scerevisiae7.fasta.gz.fai | cut -f 3 -d '#' | sort | uniq | while read CHROM; do
    CHR_FASTA=scerevisiae7.$CHROM.fasta.gz
    samtools faidx scerevisiae7.fasta.gz $(grep -P $"$CHROM\t" scerevisiae7.fasta.gz.fai | cut -f 1) | bgzip -@ 4 > $CHR_FASTA

    echo "Generated $CHR_FASTA"
done

For each set of chromosomes, to estimate the distance of each input sequence to every other sequence in the set, we use mash. In particular, the mash triangle command outputs a lower-triangular distance matrix. For example, to compute the distances between all mitochondrial sequences, execute:

mash triangle scerevisiae7.chrMT.fasta.gz > scerevisiae7.chrMT.mash_triangle.txt
cat scerevisiae7.chrMT.mash_triangle.txt | column -t

7
DBVPG6044#1#chrMT
DBVPG6765#1#chrMT    0.0192445
S288C#1#chrMT        0.0151342  0.0182524
SK1#1#chrMT          0.0023533  0.0202797  0.0144049
UWOPS034614#1#chrMT  0.0186813  0.0210181  0.0185579  0.0179508
Y12#1#chrMT          0.0188053  0.0208145  0.0126347  0.0178312  0.0148187
YPS128#1#chrMT       0.0170687  0.0198213  0.0136991  0.0175939  0.0141502  0.0131603

The distance is between 0 (identical sequences) and 1. This shows that we have 7 sequences and the distances are up to a few percent. To identify the maximum divergence, execute:

sed 1,1d scerevisiae7.chrMT.mash_triangle.txt | tr '\t' '\n' | grep chr -v | LC_ALL=C sort -g -k 1nr | uniq | head -n 1

0.0210181

To compute the maximum divergence for each set of chromosomes, execute:

ls scerevisiae7.*.fasta.gz | while read CHR_FASTA; do
    CHROM=$(echo $CHR_FASTA | cut -f 2 -d '.')
    MAX_DIVERGENCE=$(mash triangle -p 4 $CHR_FASTA | sed 1,1d | tr '\t' '\n' | grep chr -v | LC_ALL=C  sort -g -k 1nr | uniq | head -n 1)

    echo -e "$CHROM\t$MAX_DIVERGENCE" >> scerevisiae7.divergence.txt
done

cat scerevisiae7.divergence.txt | column -t

chrI     0.0178312
chrII    0.00804257
chrIII   0.0121679
chrIV    0.00759618
chrIX    0.0106545
chrMT    0.0210181
chrV     0.00892796
chrVI    9.55247e-05
chrVII   0.0639874
chrVIII  0.0385787
chrX     0.0357395
chrXI    0.0324062
chrXII   0.00900687
chrXIII  0.052117
chrXIV   0.00838426
chrXV    0.0081558
chrXVI   0.00838426

From this analysis, chrVII, chrVIII, and chrXIII sets show the higher sequence divergence, with maximum value of 0.0639874. In general, we should set a mapping identity value lower than or equal to 100 - max_divergence * 100. That is, to analyze this YPRP panel, we have to specify -p lower than or equal to 93.60126. However, in order to account for possible underestimates of sequence divergence, and medium/large structural variants leading locally to greater divergence, we recommend setting an even smaller mapping identity, like -p 90.

Inter-chromosome estimations

The YPRP panel presents known structural inter-chromosome rearrangements, for example between chromosomes chrVII and chrVIII (see the Sequence partitioning tutorial for more information). This can explain why those sets present a higher intra-chromosome divergence.

To estimate the sequence divergence between chrVII and chrVIII chromosomes, execute:

mash triangle scerevisiae7.community.0.fa.gz -s 10000 > scerevisiae7.community.0.mash_triangle.txt
cat scerevisiae7.community.0.mash_triangle.txt | column -t

14
DBVPG6044#1#chrVII
S288C#1#chrVII         0.00811043
SK1#1#chrVII           0.00160605  0.00759986
Y12#1#chrVII           0.00721034  0.00710972  0.00740957
YPS128#1#chrVII        0.00641651  0.00659004  0.00659004  0.00505504
DBVPG6044#1#chrVIII    0.243761    0.222464    0.23782     0.215016    0.221466
S288C#1#chrVIII        0.222464    0.203604    0.217673    0.194429    0.214163    0.00928959
SK1#1#chrVIII          0.250557    0.221466    0.242205    0.215016    0.216771    0.000911163  0.00912584
UWOPS034614#1#chrVII   0.0625263   0.0623939   0.0627391   0.0625528   0.0610962   0.0635756    0.0649949   0.0635211
UWOPS034614#1#chrVIII  0.0512793   0.0517248   0.0514142   0.049118    0.0496087   0.0358693    0.0373731   0.0359997   0.258493
Y12#1#chrVIII          0.236445    0.222464    0.236445    0.216771    0.231311    0.00815958   0.0081861   0.00759251  0.0625528  0.0354817
YPS128#1#chrVIII       0.239237    0.199716    0.213325    0.208588    0.191209    0.00728616   0.00790009  0.006864    0.0628727  0.0345966  0.00614217
DBVPG6765#1#chrVII     0.00890041  0.00474338  0.0079113   0.00866976  0.00768463  0.26803      0.256381    0.270747    0.0628995  0.0512793  0.26546     0.23011
DBVPG6765#1#chrVIII    0.273629    0.26546     0.273629    0.26803     0.270747    0.00861155   0.0060471   0.00853035  0.0656244  0.0360829  0.00873594  0.00826588  0.26546

The scerevisiae7.community.0.fa.gz file contains the sequences of chrVII and chrVIII sets in FASTA format (follow the Sequence partitioning tutorial to obtain the FASTA files for all the communities detectable in the YPRP panel). The -s 10000 value in mash triangle specifies a bigger sketch size for each sequence to compare: a higher value allows for more accurate estimates (see here how the distance estimation works).

The output shows that, generally, sequences from different chromosomes present a very high sequence divergence (greater than ~0.20). However, for example, the UWOPS034614#1#chrVIII sequence presents a much lower divergence with respect to the other chrVII sequences;, as shown by the following row of the lower-triangular distance matrix:

UWOPS034614#1#chrVIII  0.0512793   0.0517248   0.0514142   0.049118    0.0496087   0.0358693    0.0373731   0.0359997   0.258493

Similar considerations hold true for the DBVPG6765#1#chrVII sequence. Such lower sequence divergences are due to the structural rearrangements between these chromosomes (Yue et al., 2016).