This example shows how to create reference based assemblies of viral transcriptomes.

  1. Trimming transcriptome reads:

    1. fastqc (for example, fastqc -o qc_reports R1.fq.gz R2.fq.gz)
    2. trimmomatic (see viral sequence detection)
  2. Download reference genomes (Assembly ids listed on a text file called acc_list.txt), and unzip the output from ncbi

    1. Jingmen tick virus: GCF_000919875.1
    2. Alongshan virus: GCA_027256855.1
    programs/ncbi_datasets/datasets download genome accession --inputfile acc_list.txt --include gff3,genome,gbff
    unzip ncbi_dataset.zip
    
  3. Index the reference genomes (repeat for each reference)

    # load hisat2
    source /programs/HISAT2/hisat2.sh
    
    #index reference
    hisat2-build GCA_027256855.1_ASM2725685v1_genomic.fna GCA_027256855.1_index
    
  4. Map reads to reference

     hisat2 -p 4 -x /path/to/index -1 /path/to/R1 -2 /path/to/R2
     
     # for example,
     hisat2 -p 4 -x ./references/GCA_027256855.1/GCA_027256855.1_index -1 ../HaeL_reads/CM7_2022_93_R1_paired.fastq -2 ../HaeL_reads/CM7_2022_93_R2_paired.fastq 
     
     #sort
     samtools sort -o alnst.sorted.bam alns.sam
    
  5. Assemble transcripts using the aligned reads

    mkdir assembly
    programs/bin/cufflinks/gffread ../../references/GCA_027256855.1/GCA_027256855.1.gff -T -o ../../references/GCA_027256855.1/genomic.gtf
    stringtie mapped_reads/aligned_reads.bam -o stringtie_out/transcripts.gtf -p 4