Calculating sum of a combination of columns in pandas, row-wise, with output file with the name of said combination

Question

I am looking for a way of generating a csv file for a specific combination of data from columns in a dataframe.

My data looks like this (except with 200 more rows)

+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|            Species            | OGT |  Domain  |       A       |      C       |      D       |      E       |      F       |      G       |      H       |      I       |      K       |       L       |      M       |      N       |      P       |      Q       |      R       |      S       |      T       |      V       |      W       |      Y       |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix              |  95 | Archaea  |  9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 |   2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 |  6.266067034 | 4.2052190807 | 9.2692433532 |  1.318690698 | 3.5614200159 |
| Argobacterium fabrum          |  26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 |  9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans |  27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 |  2.940143349 | 2.3473650439 |  10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus              |  85 | Bacteria |  5.8730327277 |  0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus        |  83 | Archaea  |  7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 |  4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 |  9.4826333048 | 2.6014466253 |  3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+

What I want to do is find a way of generating a csv with species, OGT, and then a combination of a few of the other columns, say A,C,E & G and the sum of the percentages of those particular values.

So output looking something like this: (these sums are just made up)

ACEG.csv

             Species              OGT   Sum of percentage  
 ------------------------------- ----- ------------------- 
  Aeropyrum pernix                 95             23.4353  
  Anaeromyxobacter dehalogenans    26             20.3232  
  Argobacterium fabrum             27             14.2312  
  Aquifex aeolicus                 85             15.0403  
  Archaeoglobus fulgidus           83             34.0532

The aim of this is so I can do this for each of the 10 million combinations of each column (A-Y), but I figure that's a simple for loop. I intially was trying to achieve this in R but upon reflection using pandas in python is probably a better bet.

While there are many ways to do what you ask, I can think of few applications that actually require considering all combinations. You may get better recommendations if you say what you are planning to do with the data. — hilberts_drinking_problem
I plan on correlating the different combinations of these amino acid percentages and seeing which combination of AAs are the best indicators of optimal growth temperature in prokaryotic organisms, whilst accounting for other signals too. An early study was done a decade ago, but I don't believe they took into account other factors! — Biomage
So you are trying to pick a combination that maximizes some objective? If you can define that objective mathematically in a way that a layperson can understand, I bet you'd get vastly more efficient solutions. — hilberts_drinking_problem

N. P. N. P. · Accepted Answer · 2018-06-07T12:48:19

Something like this?

def subset_to_csv(cols):
    df['Sum of percentage'] = your_data[list(cols)].sum(axis=1)
    df.to_csv(cols + '.csv')

df = your_data[['Species', 'OGT']]

for c in your_list_of_combinations:
    subset_to_csv(c)

Where cols is a string containing the columns you want to subset, e.g.: 'ABC'

Calculating sum of a combination of columns in pandas, row-wise, with output file with the name of said combination

3 Answers