<aside> 💡 For how to do Ampliseq, please see AmpliSeq. If you know Python, please go directly to the “Code” section. Disclaimer: I actually don’t know Python. Coding was never my strong suit. This code relied heavily on ChatGPT 4.0, but since it streamlined a week’s work into 15 minutes, and does not generate new data, I’ll give myself a pass on that.

</aside>

Why do I need to “clean” the Ampliseq data?

Good question! The short answer is that while we advocate for standardization of data in the scientific field, we don’t really enforce it in real life. What Illumina gives us vs. what we need for analysis has some gaps in between them.

I once went to a workshop that spent the majority of the time emphasizing the importance of having a universal standard for entering metadata. Yeah, it’s THAT bad.

The long answer — Illumina’s data in its original organization makes it hard to analyze due to several reasons: 1) it gives you multiple coverage metrics, which we don’t need. We only need the “Coverage” column. 2) each sample is in its csv file, which is fine if you’re only analyzing one sample, but more than often enough we are analyzing multiple samples across the board. 3) the “Region”, the gene detected (or not), is in its original name, which doesn’t contain information on what antibiotic class this gene is resistant to. Imagine googling EVERY gene (there are thousands of them, please don’t try it) to match it to an antibiotic class. Thankfully, we have an excel sheet that contains all annotations.

What does the code do?

Process each CSV file in a specified directory, keeping only certain columns and adding a 'Sample' column.
Merge all these processed files into a single DataFrame and save it as 'merged_data.csv'.
Load the 'merged_data.csv' and an annotation CSV file.
Merge these two DataFrames on a common column with different names in each file.
Reshape the merged DataFrame similar to pivot_wider in R, focusing on the 'Coverage' values.
Save the reshaped DataFrame to a new CSV file.

<aside> 💡 At the moment, this protocol only works on MacOS systems. I have not tested it out on Windows.

</aside>

Select files

Please have “Terminal” open on your Mac system and make sure that you have the latest version of python installed.

Run these commands in Terminal first:

pip install panda

cd desktop 
#I set my directory on my desktop, but you can choose wherever you want.

Open TextEdit, go to “Format” on the bar, select “Make Plain Text”, then copy the following code and save it as a file “select_files.py”. Or you can choose a different name if you want. You need to edit 2 things for this following code: the path to where the original Ampliseq data is saved, and copy the path of a new empty folder that you just made. Save the .py file on the same directory that you choose for the command above.

import os
import glob
import shutil

root_directory = '/Users/astra207/Desktop/20230815_Run224_Ampliseq/Ampliseq_08152023-395566682'  # Replace with your main folder path
destination_directory = '/Users/astra207/Desktop/Ampliseq'  # Replace with your destination folder path

# Create the destination directory if it does not exist
if not os.path.exists(destination_directory):
    os.makedirs(destination_directory)

# Walk through all subdirectories of the root directory
for subdir, dirs, files in os.walk(root_directory):
    coverage_files = glob.glob(os.path.join(subdir, '*.coverage.csv'))

    # Copy each found file to the destination directory
    for file in coverage_files:
        shutil.copy(file, destination_directory)
        print(f"Copied file: {file} to {destination_directory}")

print("All files have been copied to the destination folder.")