Small variants evaluation

Author: Andrea Guarracino

Synopsis

The major histocompatibility complex (MHC) is a large locus in vertebrate DNA containing a set of closely linked polymorphic genes that code for cell surface proteins essential for the adaptive immune system. In humans, the MHC region occurs on chromosome 6.

Steps

Data collection

We map HPRC chromosome 6 contigs to the MHC ALT haplotypes from GRCh38:

wfmash -t 16 -m -s 5000 -p 98 MHC.fa.gz chr6.pan.fa > HPRCy1.pan.chr6_vs_MHC.paf

Then, we make a BED file for the mappings and merge the regions in it to make a single BED record per matching contig:

awk '{ print $1, $3, $4 }' HPRCy1.pan.chr6_vs_MHC.paf | tr ' ' '\t' | sort -V > HPRCy1.MHC.bed
bedtools merge -d 1000000 -i HPRCy1.MHC.bed > HPRCy1.MHC.merge.bed

Finally, we extract the FASTA file for the regions:

bedtools getfasta -fi /lizardfs/erikg/HPRC/year1/parts/chr6.pan+refs.fa -bed HPRCy1.MHC.merge.bed | bgzip -c -@ 16 > HPRCy1.MHC.fa.gz
samtools faidx HPRCy1.MHC.fa.gz

Pangenome Sequence Naming

Contigs already respect the Pangenome Sequence Naming specification (PanSN-spec).

cut -f 1 HPRCy1.MHC.fa.gz.fai | head -n 6
HG00438#1#JAHBCB010000040.1:22870000-27725000
HG00438#2#JAHBCA010000042.1:22875000-27895000
HG00621#1#JAHBCD010000020.1:22865000-27905000
HG00621#2#JAHBCC010000005.1:28460000-33400000
HG00673#1#JAHBBZ010000030.1:28480000-33310000
HG00673#2#JAHBBY010000018.1:0-2900000

Divergence estimation

The human MHC locus presents a high sequence divergence, so we will set the mapping identity (-p parameter) in pggb to 95.

Sequence partitioning

As the dataset is small, we do not need to partition sequences (see the Sequence partitioning tutorial for more information).

Pangenome graph building

To build the MHC pangenome graph, execute:

pggb -i HPRCy1.MHC.fa.gz -p 95 -n 90 -t 16 -G 13117,13219 -o HPRCy1.MHC.s10k.p95.output

Graph statistics

To collect basic graph statistics, execute:

odgi stats -i HPRCy1.MHC.s10k.p95.output/*.final.gfa -t 16 -S
#length     nodes   edges   paths
5315371     309186  429323  126

Identify variants with vg

To call variants for each contig, execute:

vg deconstruct -P chm13 -H '?' -e -a -t 48 HPRCy1.MHC.s10k.p95.output/HPRCy1.MHC.fa.gz.39ffa23.e34d4cd.be6be64.smooth.final.gfa | \
    bgzip -c -@ 16 > HPRCy1.MHC.s10k.p95.output/HPRCy1.MHC.fa.gz.39ffa23.e34d4cd.be6be64.smooth.final.chm13.vcf.gz

-H '?' avoids managing the path name hierarchy when calling variants, then emitting variants for each contig.

To filter variants (by using nesting information from the pangenome graph) and realign reference and alternate alleles, execute:

vcfbub -l 0 -a 100000 --input HPRCy1.MHC.s10k.p95.output/HPRCy1.MHC.fa.gz.39ffa23.e34d4cd.be6be64.smooth.final.chm13.vcf.gz | \
    vcfwave -I 1000 -t 48 | bgzip -@ 16 \
    > HPRCy1.MHC.s10k.p95.output/HPRCy1.MHC.fa.gz.39ffa23.e34d4cd.be6be64.smooth.final.chm13.vcfbub.a100k.wave.vcf.gz

Take SNPs from the PGGB VCF file:

REF=chm13#chr6:28380000-33300000.fa
NAMEREF=chm13

cut -f 1 HPRCy1.MHC.fa.gz.fai | grep chm13 -v | while read CONTIG; do
    echo $CONTIG

    bash vcf_preprocess.sh \
        HPRCy1.MHC.s10k.p95.output/*.vcfbub.a100k.wave.vcf.gz \
        $CONTIG \
        1 \
        $REF
done

Identify variants with nucmer

Prepare the reference FASTA file:

samtools faidx HPRCy1.MHC.fa.gz chm13#chr6:28380000-33300000 > chm13#chr6:28380000-33300000.fa

Align each sequence against the reference with nucmer:

REF=chm13#chr6:28380000-33300000.fa
NAMEREF=chm13

mkdir -p nucmer/

cut -f 1 HPRCy1.MHC.fa.gz.fai | grep chm13 -v | while read CONTIG; do
    echo $CONTIG

    PREFIX=nucmer/${CONTIG}_vs_${NAMEREF}
    samtools faidx HPRCy1.MHC.fa.gz $CONTIG > $CONTIG.fa

    nucmer $REF $CONTIG.fa --prefix "$PREFIX"

    show-snps -THC "$PREFIX".delta > "$PREFIX".var.txt

    rm $CONTIG.fa
done

Using the nucmer2vcf.R script, generate VCF files for each sequence with respect to the reference with nucmer:

REF=chm13#chr6:28380000-33300000.fa
NAMEREF=chm13

NUCMER_VERSION="4.0.0beta2"
cut -f 1 HPRCy1.MHC.fa.gz.fai | grep chm13 -v | while read CONTIG; do
    echo $CONTIG

    PREFIX=nucmer/${CONTIG}_vs_${NAMEREF}
    Rscript nucmer2vcf.R "$PREFIX".var.txt $CONTIG $REF $NUCMER_VERSION $PREFIX.vcf
    bgzip -@ 16 $PREFIX.vcf
    tabix $PREFIX.vcf.gz
done

Variants evaluation

Prepare the reference in SDF format for variant evaluation with rtg vcfeval:

rtg format -o chm13#chr6:28380000-33300000.sdf chm13#chr6:28380000-33300000.fa

Compare nucmer-based SNPs with PGGB-based SNPs:

REFSDF=chm13#chr6:28380000-33300000.sdf
NAMEREF=chm13

cut -f 1 HPRCy1.MHC.fa.gz.fai | grep chm13 -v | while read CONTIG; do
    echo $CONTIG

    PREFIX=nucmer/${CONTIG}_vs_${NAMEREF}

    PATH_PGGB_VCF=HPRCy1.MHC.s10k.p95.output/HPRCy1.MHC.fa.gz.*.smooth.final.chm13.vcfbub.a100k.wave.${CONTIG}.max1.vcf.gz

    # Merge regions closer than 1000 bps to define the callable regions where to evaluate the variants
    dist=1000

    rtg vcfeval \
        -t $REFSDF \
        -b $PREFIX.vcf.gz \
        -c $PATH_PGGB_VCF \
        -T 16 \
        -e <(bedtools intersect -a <(bedtools merge -d $dist -i $PREFIX.vcf.gz ) -b <(bedtools merge -d $dist -i $PATH_PGGB_VCF)) \
        -o vcfeval/${CONTIG}
done

Collect statistics:

cd vcfeval
(echo contig precision recall f1.score; grep None */*txt | sed 's,/summary.txt:,,' | tr -s ' ' | cut -f 1,7,8,9 -d ' ' ) | tr ' ' '\t' > statistics.tsv

Plot statistics:

require(ggplot2)
require(tidyr)

stat_df <- read.table('statistics.tsv', sep = '\t', header = T, comment.char = '?')

stat_df <- pivot_longer(stat_df,precision:f1.score,"Metric")

ggplot(stat_df,aes(x=contig,y=value,fill=contig))+
  geom_bar(stat="identity") +
  facet_wrap(~Metric, ncol = 1)+
  theme_bw() +
  theme(
    axis.ticks.x = element_blank(),
    axis.text.x = element_blank()
  ) +
  theme(legend.position="none")
../../_images/MHC.nucmer_vs_pggb.precision_recall_f1score.png