0
votes

I am looking for a way of generating a csv file for a specific combination of data from columns in a dataframe.

My data looks like this (except with 200 more rows)

+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|            Species            | OGT |  Domain  |       A       |      C       |      D       |      E       |      F       |      G       |      H       |      I       |      K       |       L       |      M       |      N       |      P       |      Q       |      R       |      S       |      T       |      V       |      W       |      Y       |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix              |  95 | Archaea  |  9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 |   2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 |  6.266067034 | 4.2052190807 | 9.2692433532 |  1.318690698 | 3.5614200159 |
| Argobacterium fabrum          |  26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 |  9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans |  27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 |  2.940143349 | 2.3473650439 |  10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus              |  85 | Bacteria |  5.8730327277 |  0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus        |  83 | Archaea  |  7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 |  4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 |  9.4826333048 | 2.6014466253 |  3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+

What I want to do is find a way of generating a csv with species, OGT, and then a combination of a few of the other columns, say A,C,E & G and the sum of the percentages of those particular values.

So output looking something like this: (these sums are just made up)

ACEG.csv

             Species              OGT   Sum of percentage  
 ------------------------------- ----- ------------------- 
  Aeropyrum pernix                 95             23.4353  
  Anaeromyxobacter dehalogenans    26             20.3232  
  Argobacterium fabrum             27             14.2312  
  Aquifex aeolicus                 85             15.0403  
  Archaeoglobus fulgidus           83             34.0532  

The aim of this is so I can do this for each of the 10 million combinations of each column (A-Y), but I figure that's a simple for loop. I intially was trying to achieve this in R but upon reflection using pandas in python is probably a better bet.

3
Could you clarify what you mean by sum of percentages?P.Tillmann
Just adding them together!Biomage
While there are many ways to do what you ask, I can think of few applications that actually require considering all combinations. You may get better recommendations if you say what you are planning to do with the data.hilberts_drinking_problem
I plan on correlating the different combinations of these amino acid percentages and seeing which combination of AAs are the best indicators of optimal growth temperature in prokaryotic organisms, whilst accounting for other signals too. An early study was done a decade ago, but I don't believe they took into account other factors!Biomage
So you are trying to pick a combination that maximizes some objective? If you can define that objective mathematically in a way that a layperson can understand, I bet you'd get vastly more efficient solutions.hilberts_drinking_problem

3 Answers

2
votes

Something like this?

def subset_to_csv(cols):
    df['Sum of percentage'] = your_data[list(cols)].sum(axis=1)
    df.to_csv(cols + '.csv')

df = your_data[['Species', 'OGT']]

for c in your_list_of_combinations:
    subset_to_csv(c)

Where cols is a string containing the columns you want to subset, e.g.: 'ABC'

1
votes

Here is what you could try:

from itertools import product
from string import ascii_uppercase
import pandas as pd

combinations = [''.join(i) for i in product(ascii_uppercase, repeat = 4)]

for combination in combinations:
    new_df = df[['Species', 'OGT']]
    new_df['Sum of percentage'] = df[list(combination)]
    new_df.to_csv(combination + '.csv')

====

Edit following the comment of Yakym Pirozhenko, combinations should rather use itertools.combinations to avoid duplications like 'AAAA':

combinations = [''.join(i) for i in itertools.combinations(ascii_uppercase, r = 4)]
1
votes

Not an answer to the original question, but this might be useful given the discussion.

The goal is to find a combination of columns such that the column sum has the maximal correlation with OGT. This can be easy because covariance is bilinear:

  • cov(OGT, A+B) = cov(OGT, A) + cov(OGT, B).

I am relying on two simplifying assumptions:

  1. Factors A, B, C, etc. are independent.
  2. Species are weighted equally.
  3. The variance of each factor is 1.

The idea:

  1. Normalize all columns to have unit variance (i.e. assumption 3).
  2. Compute covariances of OGT with each column.
  3. Sort factors A, B, C in order of decreasing covariance. An optimal combination will occur as a prefix of this arrangement.
  4. Which prefix should we choose? The one where sum over standard deviation is the greatest. Because of the normalization in step 1, each standard deviation of the sum of each prefix is just sqrt(n) for a prefix of size n. It remains to find a maximal index in a series, which is easy.

This may be a tad bit faster than checking all possible combinations.


import pandas as pd
import numpy as np

# set up fake data
import string

df = pd.DataFrame(np.random.rand(3, 26), columns=list(string.ascii_uppercase))

df["species"] = ["dog", "cat", "human"]
df["OGT"] = np.random.randint(0, 100, 3)
df = df.set_index("species")

# actual work
alpha_cols = list(string.ascii_uppercase)
# normalize standard deviations of each column
df = df[alpha_cols + ["OGT"]].div(df.std(0), axis=1)
# compute correlations (= covariances) of OGT with each column
corrs = df.corrwith(df.OGT).sort_values(ascending=False)
del corrs["OGT"]

# sort covariances in order from the greatest to the smallest
# compute cumulative sums
# divide by standard deviation of a group (i.e. sqrt(n) at index n-1)
cutoff = (corrs.cumsum() / np.sqrt(np.arange(corrs.shape[0]) + 1)).idxmax()
answer = sorted(corrs.loc[:cutoff].index.values)
print(answer)

# e.g.
# ['B', 'I', 'K', 'O', 'Q', 'S', 'U', 'V', 'Y']