1. Create a list of accessions. For example, I could save the following accessions to a file called accession_list.txt, which looks like:

    GCA_020413405.1
    GCA_029095565.1	
    GCA_020587255.1	
    
    
  2. Access the HPC (via ssh, VNC, Putty)

  3. Use the NCBI datasets tool to download assemblies (saved to accession_list.txt)

    /programs/ncbi_datasets/datasets download genome accession --inputfile accession_list.txt --include gff3,genome
    
    1. Here we’re using the --include gff3,genome flag to downloading annotation (gff3) and DNA sequences in fasta format (genome)

    2. You should have downloaded a directory called ncbi_dataset.zip. To access this directory, unzip it:

      unzip ncbi_dataset.zip 
      
    3. Once unzipped you should see a directory called “data” which contains separate directories for each accession ID which contains fasta files and gffs. Note gffs are often downloaded without unique IDs (for example, genomic.gff). We will change the file names and move files to format for panaroo in the next step

  4. Format input files for panaroo

    1. Enter the ncbi_dataset directory

      cd ncbi_dataset
      
    2. make a new directory for fasta files

      mkdir genomes
      
    3. find fasta files and move them to the new directoy you made:

      find . -name "*genomic.fna" -exec cp {} ./genomes \\;
      
    4. I like to rename fasta files because I think the default NCBI names are too long. You can do this using a script I wrote. Note: this will only work for names that are in the following format: Accession_otherID_genomic.fna (for example: GCA_020587255.1_PDT001158129.2_genomic.fna). If your files are formatted differently, skip this step

      cd genomes
      ls *.fna > fasta_name_list_long.txt
      /workdir/kc649/scripts/rename_fastas.sh fasta_name_list_long.txt
      
    5. Next, we will rename gffs so they each have unique names:

      # first move back to the ncbi_datasets/data directory
      cd ../data
      /workdir/kc649/scripts/copy_and_rename_gff.sh ../../accession_list.txt ../genomes/
      
    6. nNow we can use a script already created by panaroo to reformat input files:

      cd ../genomes/
      /workdir/kc649/scripts/run_convert_refseq_to_prokka.sh ../../accession_list.txt 
      
    7. If run successfully, you will have new _prokka.gff files

  5. Running panaroo

     
    export PATH=/programs/panaroo-1.5.0/bin:$PATH
    export PYTHONPATH=/programs/panaroo-1.5.0/lib/python3.9/site-packages:/programs/panaroo-1.5.0/lib64/python3.9/site-packages
    panaroo -i *prokka.gff -o panaroo_results --clean-mode strict --remove-invalid-genes -t 15
    
    1. This will run panaroo on all files that end with prokka.gff, and will use the strict clean mode and removes invalid genes. the -t 15 option indicates 15 threads. You can read about more options here: https://gthlab.au/panaroo/#/gettingstarted/quickstart
  6. Creating a core genome alignment using MAFFT through panaroo

    # go to the directory where your panaroo results are stored
    cd panaroo_results
    
    # run mafft via the panaroo-msa function
    panaroo-msa --aligner mafft -a core --core_threshold 0.95 -t 10 --verbose -o . 
    
    1. this will output a core genome alignment using a core threshold of 95% (meaning a core gene is defined as a gene present in at least 95% of all genomes) and ten threads.
  7. Run gblocks to remove poorly aligned positions

    /programs/Gblocks_0.91b/Gblocks core_gene_alignment_filtered.aln -t=d
    
  8. Using IQTREE to create a core phylogeny on the gblocks filtered alignment

    /programs/iqtree-2.2.2.6-Linux/bin/iqtree2 -s core_gene_alignment_filtered.aln-gb -m TEST -T 4 -B 
    
    
    1. this will automatically select the best substitution model (-m TEST) and will run an ultrafast bootstrap step to test for node reliability (-B option)

Pangenome association analysis - scoary

Scoary (and scoary2) are useful to identify genes associated with a trait. Scoary2 is a faster implementation of the original scoary software

https://github.com/MrTomRod/scoary-2