Pangenome workflow

Create a list of accessions. For example, I could save the following accessions to a file called accession_list.txt, which looks like:
```
GCA_020413405.1
GCA_029095565.1	
GCA_020587255.1	
```
Access the HPC (via ssh, VNC, Putty)
Use the NCBI datasets tool to download assemblies (saved to accession_list.txt)
```
/programs/ncbi_datasets/datasets download genome accession --inputfile accession_list.txt --include gff3,genome
```
1. Here we’re using the --include gff3,genome flag to downloading annotation (gff3) and DNA sequences in fasta format (genome)
2. You should have downloaded a directory called ncbi_dataset.zip. To access this directory, unzip it:
```
unzip ncbi_dataset.zip 
```
3. Once unzipped you should see a directory called “data” which contains separate directories for each accession ID which contains fasta files and gffs. Note gffs are often downloaded without unique IDs (for example, genomic.gff). We will change the file names and move files to format for panaroo in the next step
Format input files for panaroo
1. Enter the ncbi_dataset directory
```
cd ncbi_dataset
```
2. make a new directory for fasta files
```
mkdir genomes
```
3. find fasta files and move them to the new directoy you made:
```
find . -name "*genomic.fna" -exec cp {} ./genomes \\;
```
4. I like to rename fasta files because I think the default NCBI names are too long. You can do this using a script I wrote. Note: this will only work for names that are in the following format: Accession_otherID_genomic.fna (for example: GCA_020587255.1_PDT001158129.2_genomic.fna). If your files are formatted differently, skip this step
```
cd genomes
ls *.fna > fasta_name_list_long.txt
/workdir/kc649/scripts/rename_fastas.sh fasta_name_list_long.txt
```
5. Next, we will rename gffs so they each have unique names:
```
# first move back to the ncbi_datasets/data directory
cd ../data
/workdir/kc649/scripts/copy_and_rename_gff.sh ../../accession_list.txt ../genomes/
```
6. nNow we can use a script already created by panaroo to reformat input files:
```
cd ../genomes/
/workdir/kc649/scripts/run_convert_refseq_to_prokka.sh ../../accession_list.txt 
```
7. If run successfully, you will have new _prokka.gff files
Running panaroo
```
 
export PATH=/programs/panaroo-1.5.0/bin:$PATH
export PYTHONPATH=/programs/panaroo-1.5.0/lib/python3.9/site-packages:/programs/panaroo-1.5.0/lib64/python3.9/site-packages
panaroo -i *prokka.gff -o panaroo_results --clean-mode strict --remove-invalid-genes -t 15
```
1. This will run panaroo on all files that end with prokka.gff, and will use the strict clean mode and removes invalid genes. the -t 15 option indicates 15 threads. You can read about more options here: https://gthlab.au/panaroo/#/gettingstarted/quickstart
Creating a core genome alignment using MAFFT through panaroo
```
# go to the directory where your panaroo results are stored
cd panaroo_results

# run mafft via the panaroo-msa function
panaroo-msa --aligner mafft -a core --core_threshold 0.95 -t 10 --verbose -o . 
```
1. this will output a core genome alignment using a core threshold of 95% (meaning a core gene is defined as a gene present in at least 95% of all genomes) and ten threads.

Run gblocks to remove poorly aligned positions

/programs/Gblocks_0.91b/Gblocks core_gene_alignment_filtered.aln -t=d

Using IQTREE to create a core phylogeny on the gblocks filtered alignment
```
/programs/iqtree-2.2.2.6-Linux/bin/iqtree2 -s core_gene_alignment_filtered.aln-gb -m TEST -T 4 -B 
```
1. this will automatically select the best substitution model (-m TEST) and will run an ultrafast bootstrap step to test for node reliability (-B option)

Pangenome association analysis - scoary

Scoary (and scoary2) are useful to identify genes associated with a trait. Scoary2 is a faster implementation of the original scoary software

https://github.com/MrTomRod/scoary-2

format input data:
- traits:
  
  Trait trait-1 trait-2 trait-3
  
  isolate-1 1 1 0
  
  isolate-2 1 0 1
  
  isolate-3 0 0 0
- gene presence absence matrix:
  
  Gene isolate-1 isolate-2 isolate-3
  
  gene-1 0 1 1
  
  gene-2 1 1 0
  
  gene-3 1 2 0

Trait	trait-1	trait-2	trait-3
isolate-1	1	1	0
isolate-2	1	0	1
isolate-3	0	0	0

Gene	isolate-1	isolate-2	isolate-3
gene-1	0	1	1
gene-2	1	1	0
gene-3	1	2	0

load environment for scoary2:

module load python/3.10.6-r9

export PYTHONPATH=/programs/scoary-2/lib/python3.10/site-packages:/programs/scoary-2/lib64/python3.10/site-packages

export PATH=/programs/scoary-2/bin:$PATH

run scoary2

required inputs are gene presence/absence matrix and traits. A tree is optional

scoary2 \\
    --genes Gene_presence_absence.csv \\
    --gene-data-type 'gene-count:,' \\
    --traits Tetracycline_resistance.csv \\
    --outdir out \\
    --n-permut 1000 \\
    --newicktree core_tree.nwk