I'm trying to compare two CSV files (and many more like these below). I tried many ways, using lists, dictreader and more but nothing gave me the output I require. I want to compare all those rows that have same !Sample_title and !Sample_geo_accession values (whose positions vary). I've been struggling with this for three days now and couldn't come to a solution. I highly appreciate any help.
CSV1:
!Sample_title,!Sample_geo_accession,!Sample_status,!Sample_type,!Sample_source_name_ch1
body,GSM501443,Public on july 22 2010,ribonucleic acid,FB_50_12wk
foreign,GSM501445,Public on july 22 2010,ribonucleic acid,FB_0_12wk
HJCENV,GSM501446,Public on july 22 2010,ribonucleic acid,FB_50_12wk
AsDW,GSM501444,Public on july 22 2010,ribonucleic acid,FB_0_12wk
CSV2:
!Sample_title,!Sample_type,!Sample_source_name_ch1,!Sample_geo_accession
AsDW,ribonucleic acid,FB_0,GSM501444
foreign,ribonucleic acid,FB,GSM501449
HJCENV,RNA,12wk,GSM501446
Desired output (with respect to CSV2):
Added:
{!Sample_status:{HJCENV:Public on july 22 2010,AsDW:Public on july 22 2010}} #Added columns, not rows.
Deleted:
{} #Since nothing's deleted with respect to CSV2
Changed:
{!Sample_title:AsDW,!Sample_source_name_ch1:(FB_0_12wk,FB_0),!Sample_geo_accession:GSM501444
!Sample_title:HJCENV,!Sample_type:(ribonucleic acid,RNA),!Sample_source_name_ch1:(FB_50_12wk,12wk),!Sample_geo_accession:GSM501446}
#foreign,ribonucleic acid,FB,GSM501449 doesn't come here since the !Sample_geo_accession column values didn't match.
EDIT:
Here Added dictionary should give any additional columns and their values for each !Sample_title (when !Sample_title and !Sample_geo_accession match in CSV1 and CSV2) that are found in CSV1 (if it has more number of columns than CSV2)
Deleted dictionary does similar thing as Added, except that it looks for deleted columns.
Changed gives the values that are different in both the files and their header.
So basically it should compare apples and apples (when header names match), not apples and oranges (by column position)
index
to find what positions you are looking for in a row ? – kiran.koduru!Sample_title
or!Sample_source_name_chi1
in same dictionary unless you run an iteration to update. And also, this is a good case to use Python's Pandas module. You could import both csv files as data frames, merge by the !Sample_title and output, aggregate, pivot corresponding columns. – Parfait