.. _small-variants-evaluation: #################### Small variants evaluation #################### **Author**: `Andrea Guarracino `_ ======== Synopsis ======== The `major histocompatibility complex `_ (MHC) is a large locus in vertebrate DNA containing a set of closely linked polymorphic genes that code for cell surface proteins essential for the adaptive immune system. In humans, the MHC region occurs on chromosome 6. ===== Steps ===== ------------------------- Data collection ------------------------- We map HPRC chromosome 6 contigs to the MHC ALT haplotypes from GRCh38: .. code-block:: bash wfmash -t 16 -m -s 5000 -p 98 MHC.fa.gz chr6.pan.fa > HPRCy1.pan.chr6_vs_MHC.paf Then, we make a BED file for the mappings and merge the regions in it to make a single BED record per matching contig: .. code-block:: bash awk '{ print $1, $3, $4 }' HPRCy1.pan.chr6_vs_MHC.paf | tr ' ' '\t' | sort -V > HPRCy1.MHC.bed bedtools merge -d 1000000 -i HPRCy1.MHC.bed > HPRCy1.MHC.merge.bed Finally, we extract the FASTA file for the regions: .. code-block:: bash bedtools getfasta -fi /lizardfs/erikg/HPRC/year1/parts/chr6.pan+refs.fa -bed HPRCy1.MHC.merge.bed | bgzip -c -@ 16 > HPRCy1.MHC.fa.gz samtools faidx HPRCy1.MHC.fa.gz ------------------------- Pangenome Sequence Naming ------------------------- Contigs already respect the Pangenome Sequence Naming specification (`PanSN-spec `_). .. code-block:: bash cut -f 1 HPRCy1.MHC.fa.gz.fai | head -n 6 .. code-block:: none HG00438#1#JAHBCB010000040.1:22870000-27725000 HG00438#2#JAHBCA010000042.1:22875000-27895000 HG00621#1#JAHBCD010000020.1:22865000-27905000 HG00621#2#JAHBCC010000005.1:28460000-33400000 HG00673#1#JAHBBZ010000030.1:28480000-33310000 HG00673#2#JAHBBY010000018.1:0-2900000 ------------------------- Divergence estimation ------------------------- The human MHC locus presents a high sequence divergence, so we will set the mapping identity (``-p`` parameter) in ``pggb`` to ``95``. ------------------------- Sequence partitioning ------------------------- As the dataset is small, we do not need to partition sequences (see the :ref:`sequence_partitioning` tutorial for more information). ------------------------- Pangenome graph building ------------------------- To build the MHC pangenome graph, execute: .. code-block:: bash pggb -i HPRCy1.MHC.fa.gz -p 95 -n 90 -t 16 -G 13117,13219 -o HPRCy1.MHC.s10k.p95.output ------------------------- Graph statistics ------------------------- To collect basic graph statistics, execute: .. code-block:: bash odgi stats -i HPRCy1.MHC.s10k.p95.output/*.final.gfa -t 16 -S .. code-block:: none #length nodes edges paths 5315371 309186 429323 126 ------------------------- Identify variants with vg ------------------------- To call variants for each contig, execute: .. code-block:: bash vg deconstruct -P chm13 -H '?' -e -a -t 48 HPRCy1.MHC.s10k.p95.output/HPRCy1.MHC.fa.gz.39ffa23.e34d4cd.be6be64.smooth.final.gfa | \ bgzip -c -@ 16 > HPRCy1.MHC.s10k.p95.output/HPRCy1.MHC.fa.gz.39ffa23.e34d4cd.be6be64.smooth.final.chm13.vcf.gz ``-H '?'`` avoids managing the path name hierarchy when calling variants, then emitting variants for each contig. To filter variants (by using nesting information from the pangenome graph) and realign reference and alternate alleles, execute: .. code-block:: bash vcfbub -l 0 -a 100000 --input HPRCy1.MHC.s10k.p95.output/HPRCy1.MHC.fa.gz.39ffa23.e34d4cd.be6be64.smooth.final.chm13.vcf.gz | \ vcfwave -I 1000 -t 48 | bgzip -@ 16 \ > HPRCy1.MHC.s10k.p95.output/HPRCy1.MHC.fa.gz.39ffa23.e34d4cd.be6be64.smooth.final.chm13.vcfbub.a100k.wave.vcf.gz Take SNPs from the PGGB VCF file: .. code-block:: bash REF=chm13#chr6:28380000-33300000.fa NAMEREF=chm13 cut -f 1 HPRCy1.MHC.fa.gz.fai | grep chm13 -v | while read CONTIG; do echo $CONTIG bash vcf_preprocess.sh \ HPRCy1.MHC.s10k.p95.output/*.vcfbub.a100k.wave.vcf.gz \ $CONTIG \ 1 \ $REF done ------------------------- Identify variants with nucmer ------------------------- Prepare the reference FASTA file: .. code-block:: bash samtools faidx HPRCy1.MHC.fa.gz chm13#chr6:28380000-33300000 > chm13#chr6:28380000-33300000.fa Align each sequence against the reference with `nucmer <10.1186/gb-2004-5-2-r12>`_: .. code-block:: bash REF=chm13#chr6:28380000-33300000.fa NAMEREF=chm13 mkdir -p nucmer/ cut -f 1 HPRCy1.MHC.fa.gz.fai | grep chm13 -v | while read CONTIG; do echo $CONTIG PREFIX=nucmer/${CONTIG}_vs_${NAMEREF} samtools faidx HPRCy1.MHC.fa.gz $CONTIG > $CONTIG.fa nucmer $REF $CONTIG.fa --prefix "$PREFIX" show-snps -THC "$PREFIX".delta > "$PREFIX".var.txt rm $CONTIG.fa done Using the ``nucmer2vcf.R`` script, generate VCF files for each sequence with respect to the reference with `nucmer <10.1186/gb-2004-5-2-r12>`_: .. code-block:: bash REF=chm13#chr6:28380000-33300000.fa NAMEREF=chm13 NUCMER_VERSION="4.0.0beta2" cut -f 1 HPRCy1.MHC.fa.gz.fai | grep chm13 -v | while read CONTIG; do echo $CONTIG PREFIX=nucmer/${CONTIG}_vs_${NAMEREF} Rscript nucmer2vcf.R "$PREFIX".var.txt $CONTIG $REF $NUCMER_VERSION $PREFIX.vcf bgzip -@ 16 $PREFIX.vcf tabix $PREFIX.vcf.gz done ------------------------- Variants evaluation ------------------------- Prepare the reference in ``SDF`` format for variant evaluation with ``rtg vcfeval``: .. code-block:: bash rtg format -o chm13#chr6:28380000-33300000.sdf chm13#chr6:28380000-33300000.fa Compare nucmer-based SNPs with PGGB-based SNPs: .. code-block:: bash REFSDF=chm13#chr6:28380000-33300000.sdf NAMEREF=chm13 cut -f 1 HPRCy1.MHC.fa.gz.fai | grep chm13 -v | while read CONTIG; do echo $CONTIG PREFIX=nucmer/${CONTIG}_vs_${NAMEREF} PATH_PGGB_VCF=HPRCy1.MHC.s10k.p95.output/HPRCy1.MHC.fa.gz.*.smooth.final.chm13.vcfbub.a100k.wave.${CONTIG}.max1.vcf.gz # Merge regions closer than 1000 bps to define the callable regions where to evaluate the variants dist=1000 rtg vcfeval \ -t $REFSDF \ -b $PREFIX.vcf.gz \ -c $PATH_PGGB_VCF \ -T 16 \ -e <(bedtools intersect -a <(bedtools merge -d $dist -i $PREFIX.vcf.gz ) -b <(bedtools merge -d $dist -i $PATH_PGGB_VCF)) \ -o vcfeval/${CONTIG} done Collect statistics: .. code-block:: bash cd vcfeval (echo contig precision recall f1.score; grep None */*txt | sed 's,/summary.txt:,,' | tr -s ' ' | cut -f 1,7,8,9 -d ' ' ) | tr ' ' '\t' > statistics.tsv Plot statistics: .. code-block:: R require(ggplot2) require(tidyr) stat_df <- read.table('statistics.tsv', sep = '\t', header = T, comment.char = '?') stat_df <- pivot_longer(stat_df,precision:f1.score,"Metric") ggplot(stat_df,aes(x=contig,y=value,fill=contig))+ geom_bar(stat="identity") + facet_wrap(~Metric, ncol = 1)+ theme_bw() + theme( axis.ticks.x = element_blank(), axis.text.x = element_blank() ) + theme(legend.position="none") .. image:: /img/MHC.nucmer_vs_pggb.precision_recall_f1score.png