Escherichia coli
Author: Andrea Guarracino
Synopsis
Escherichia coli (E. coli) is a gram-negative bacillus known to be a part of normal intestinal flora but can also be the cause of intestinal and extraintestinal illness in humans Gill et al., 2006. Here we study E. coli genomic diversity by analyzing a pangenome graph made with 2224 assemblies.
Steps
Download the assemblies
Assuming that your current working directory is the root of the pggb
repository, to download 2224 assemblies of E. coli,
execute:
mkdir -p assemblies/e_coli
cd assemblies/e_coli
cat ../../docs/data/ecoli.urls | parallel -j 4 'wget -q {} && echo got {}'
Pangenome Sequence Naming
To change the sequence names according to PanSN-spec, we use fastix:
ls *.fna.gz | while read f; do
sample_name=$(echo $f | cut -f 1,2 -d '_');
echo ${sample_name}
# 'cut -f 1' to trim the headers
fastix -p "${sample_name}#1#" <(zcat $f | cut -f 1) | bgzip -@ 4 -c > ${sample_name}.fa.gz
done
We specify haplotype_id
equals to 1
for all the assemblies.
Indeed, most bacteria in general, including E. coli, contain one homolog of their single chromosome, and therefore are considered to be haploid.