Create a list of accessions. For example, I could save the following accessions to a file called accession_list.txt, which looks like:
GCA_020413405.1
GCA_029095565.1
GCA_020587255.1
Access the HPC (via ssh, VNC, Putty)
Use the NCBI datasets tool to download assemblies (saved to accession_list.txt)
/programs/ncbi_datasets/datasets download genome accession --inputfile accession_list.txt --include gff3,genome
Here we’re using the --include gff3,genome flag to downloading annotation (gff3) and DNA sequences in fasta format (genome)
You should have downloaded a directory called ncbi_dataset.zip. To access this directory, unzip it:
unzip ncbi_dataset.zip
Once unzipped you should see a directory called “data” which contains separate directories for each accession ID which contains fasta files and gffs. Note gffs are often downloaded without unique IDs (for example, genomic.gff). We will change the file names and move files to format for panaroo in the next step
Format input files for panaroo
Enter the ncbi_dataset directory
cd ncbi_dataset
make a new directory for fasta files
mkdir genomes
find fasta files and move them to the new directoy you made:
find . -name "*genomic.fna" -exec cp {} ./genomes \\;
I like to rename fasta files because I think the default NCBI names are too long. You can do this using a script I wrote. Note: this will only work for names that are in the following format: Accession_otherID_genomic.fna (for example: GCA_020587255.1_PDT001158129.2_genomic.fna). If your files are formatted differently, skip this step
cd genomes
ls *.fna > fasta_name_list_long.txt
/workdir/kc649/scripts/rename_fastas.sh fasta_name_list_long.txt
Next, we will rename gffs so they each have unique names:
# first move back to the ncbi_datasets/data directory
cd ../data
/workdir/kc649/scripts/copy_and_rename_gff.sh ../../accession_list.txt ../genomes/
nNow we can use a script already created by panaroo to reformat input files:
cd ../genomes/
/workdir/kc649/scripts/run_convert_refseq_to_prokka.sh ../../accession_list.txt
If run successfully, you will have new _prokka.gff files
Running panaroo
export PATH=/programs/panaroo-1.5.0/bin:$PATH
export PYTHONPATH=/programs/panaroo-1.5.0/lib/python3.9/site-packages:/programs/panaroo-1.5.0/lib64/python3.9/site-packages
panaroo -i *prokka.gff -o panaroo_results --clean-mode strict --remove-invalid-genes -t 15
Creating a core genome alignment using MAFFT through panaroo
# go to the directory where your panaroo results are stored
cd panaroo_results
# run mafft via the panaroo-msa function
panaroo-msa --aligner mafft -a core --core_threshold 0.95 -t 10 --verbose -o .
Run gblocks to remove poorly aligned positions
/programs/Gblocks_0.91b/Gblocks core_gene_alignment_filtered.aln -t=d
Using IQTREE to create a core phylogeny on the gblocks filtered alignment
/programs/iqtree-2.2.2.6-Linux/bin/iqtree2 -s core_gene_alignment_filtered.aln-gb -m TEST -T 4 -B
Scoary (and scoary2) are useful to identify genes associated with a trait. Scoary2 is a faster implementation of the original scoary software
https://github.com/MrTomRod/scoary-2
format input data:
traits:
| Trait | trait-1 | trait-2 | trait-3 |
|---|---|---|---|
| isolate-1 | 1 | 1 | 0 |
| isolate-2 | 1 | 0 | 1 |
| isolate-3 | 0 | 0 | 0 |
gene presence absence matrix:
| Gene | isolate-1 | isolate-2 | isolate-3 |
|---|---|---|---|
| gene-1 | 0 | 1 | 1 |
| gene-2 | 1 | 1 | 0 |
| gene-3 | 1 | 2 | 0 |
load environment for scoary2:
module load python/3.10.6-r9
export PYTHONPATH=/programs/scoary-2/lib/python3.10/site-packages:/programs/scoary-2/lib64/python3.10/site-packages
export PATH=/programs/scoary-2/bin:$PATH
run scoary2
scoary2 \\
--genes Gene_presence_absence.csv \\
--gene-data-type 'gene-count:,' \\
--traits Tetracycline_resistance.csv \\
--outdir out \\
--n-permut 1000 \\
--newicktree core_tree.nwk