0
votes

I'm trying to compare two CSV files (and many more like these below). I tried many ways, using lists, dictreader and more but nothing gave me the output I require. I want to compare all those rows that have same !Sample_title and !Sample_geo_accession values (whose positions vary). I've been struggling with this for three days now and couldn't come to a solution. I highly appreciate any help.

CSV1:

!Sample_title,!Sample_geo_accession,!Sample_status,!Sample_type,!Sample_source_name_ch1
body,GSM501443,Public on july 22 2010,ribonucleic acid,FB_50_12wk
foreign,GSM501445,Public on july 22 2010,ribonucleic acid,FB_0_12wk
HJCENV,GSM501446,Public on july 22 2010,ribonucleic acid,FB_50_12wk
AsDW,GSM501444,Public on july 22 2010,ribonucleic acid,FB_0_12wk

CSV2:

!Sample_title,!Sample_type,!Sample_source_name_ch1,!Sample_geo_accession
AsDW,ribonucleic acid,FB_0,GSM501444
foreign,ribonucleic acid,FB,GSM501449
HJCENV,RNA,12wk,GSM501446

Desired output (with respect to CSV2):

Added:

{!Sample_status:{HJCENV:Public on july 22 2010,AsDW:Public on july 22 2010}} #Added columns, not rows.

Deleted:

{} #Since nothing's deleted with respect to CSV2

Changed:

{!Sample_title:AsDW,!Sample_source_name_ch1:(FB_0_12wk,FB_0),!Sample_geo_accession:GSM501444
!Sample_title:HJCENV,!Sample_type:(ribonucleic acid,RNA),!Sample_source_name_ch1:(FB_50_12wk,12wk),!Sample_geo_accession:GSM501446}
#foreign,ribonucleic acid,FB,GSM501449 doesn't come here since the !Sample_geo_accession column values didn't match. 

EDIT:

Here Added dictionary should give any additional columns and their values for each !Sample_title (when !Sample_title and !Sample_geo_accession match in CSV1 and CSV2) that are found in CSV1 (if it has more number of columns than CSV2)

Deleted dictionary does similar thing as Added, except that it looks for deleted columns.

Changed gives the values that are different in both the files and their header.

So basically it should compare apples and apples (when header names match), not apples and oranges (by column position)

1
Can you use the index to find what positions you are looking for in a row ?kiran.koduru
by index, if you mean hardcoding column numbers, no, since they might change from file to file. @kiran.koduruabn
possible duplicate of python compare dict reader elements from two csv filesuser4512135
Please note dictionary keys must be unique. You would not be able to reuse !Sample_title or !Sample_source_name_chi1 in same dictionary unless you run an iteration to update. And also, this is a good case to use Python's Pandas module. You could import both csv files as data frames, merge by the !Sample_title and output, aggregate, pivot corresponding columns.Parfait
@abn: if you don't get an answer, pay attention to and address any requests for clarification; edit your question as many times as needed to improve it. Never ask the same question multiple times; it's a violation, it degrades SO, and people will (rightly) avoid you.smci

1 Answers

1
votes

Your question is still very badly-defined. First we had to decode the question. You said "diff two CSV files", which normally means row-wise diffs, perhaps first with a row-wise reordering by index columns ['!Sample_title','!Sample_geo_accession']

But you actually wanted column-wise diffs. Specifically you want to know which columns were added in csv2, which columns were deleted, and for the common columns, which entries (rows) were changed in csv2. Now, do you want those diffs computed and presented by individual series, or simultaneously across all columns?

Something like the following:

import pandas as pd
pd.options.display.width = 200

df1 = pd.read_csv('1.csv', index_col=['!Sample_title','!Sample_geo_accession'])
df2 = pd.read_csv('2.csv', index_col=['!Sample_title','!Sample_geo_accession'])

cols_common  = (df1.columns & df2.columns).tolist()
cols_added   = (df2.columns - df1.columns).tolist()
cols_deleted = (df1.columns - df2.columns).tolist()

print "\nAdded",   df2.ix[:, cols_added]
print "\nDeleted", df1.ix[:, cols_deleted]
print "\nChanged", df2.ix[:, cols_common]

Output:

Added:
[(AsDW, GSM501444), (foreign, GSM501449), (HJCENV, GSM501446)]

Deleted                                              !Sample_status
!Sample_title !Sample_geo_accession                        
body          GSM501443              Public on july 22 2010
foreign       GSM501445              Public on july 22 2010
HJCENV        GSM501446              Public on july 22 2010
AsDW          GSM501444              Public on july 22 2010

Changed                                          !Sample_type !Sample_source_name_ch1
!Sample_title !Sample_geo_accession                                          
AsDW          GSM501444              ribonucleic acid                    FB_0
foreign       GSM501449              ribonucleic acid                      FB
HJCENV        GSM501446                           RNA                    12wk

It seems like you also need us to reorder the columns so df1, df2 are in the same order. But you haven't told us how we should compare '!Sample_source_name_ch1', because 'FB_0_12wk' != '12wk'.

I'm not going to work on this anymore until you fix the clarity of what you're asking for.