0
votes

I've currently split my data into CSV files that need all rows of the column "sequence" combined into one string.

Each CSV looks something like this:

1773.csv
ID      Order    Sequence
1773     1        'AAGG'
1773     2        'TTGG'
1773     3        'GGAA'

1775.csv
ID      Order    Sequence
1775     1         'GGTT'
1775     2         'AAGT'
1775     3         'TGAA'

1331.csv
ID      Order    Sequence
1331     1         'CCGT'
1331     2         'CATT'
1331     3         'GTTA'

I need each CSV to merge each sequence row into one value like this:

ID       Sequence
1773   'AAGGTTGGGGAA'

Then make a master CSV of all the combined sequences from each CSV file.

Something like this:

ID        Sequence
1773    'AAGGTTGGGGAA'
1775    'GGTTAAGTTGAA'
1331    'CCGTCATTGTTA'

I wouldn't worry too much about the order column since the rows are already in order. +Each CSV in the folder is just it's ID

I've found this but it seems to combine all data from all csv files into a single cell/value:

def return_contents(file_name):
    with open(file_name) as infile:
        reader = csv.reader(infile)
        return list(reader)

all_files = os.listdir('C:\\Users\\CAAVR\\Desktop\\res_csv')

combined_output = []

for file in all_files:
    data = return_contents('C:\\Users\\CAAVR\\Desktop\\res_csv\\{}'.format(file))
    for row in data:
        combined_output.extend(row)

with open('csv_out.csv', 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(combined_output)

Thanks ahead of time and let me know if you need more info.

2
Well, that details what you need but says nothing of what you tried and the issues you faced. It also doesn't give a minimal reproducible example because "GGTT" isn't in the input sequences - roganjosh
I've added some code that I tried. I'm not 100% on your second comment though. If you are referring to the master csv that would be another combined sequence - Scott Valentine
Thank you for the edit. The second part of my comment is that examples should be self-consistent. How could you expect 'GGTTAAGTTGAA' from what's listed in the first code block? "GGTT" happened to be a sequence that stood out, but there are other examples. - roganjosh
I've added the inputs. - Scott Valentine

2 Answers

0
votes

I would use pandas for this task

pandas.read_csv("input.dat").groupby("id")['Sequence'].agg('sum').to_csv('output.dat')

I think would do what you ask

0
votes

I've found this but it seems to combine all data into a single row:

[...]
combined_output.extend(row)

extend adds all items from row individually to combined_output, losing the information that they once belonged together and creating a single flat list.

Instead, you should use append. It adds row to combined_output while keeping the items of row in their own list. The result is a list of lists, where each sublist represents one row in the output CSV.

See What is the difference between Python's list methods append and extend?.