Pandas or Dask dataframe, fill in values based on missing grouping variable combinations

Question

Dask vs. Pandas dataframes may not make a difference here, other than no multiindex in Dask, but I have a Dask dataframe like:

dd = pd.DataFrame({
    'name': ['a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a2'],
    'key1': ['A',  'A',  'B',  'B',  'A' , 'A',  'B' ],
    'key2': ['C',  'D',  'C',  'D',  'C',  'D',  'C' ],
    'val1': [0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7 ],
    'val2': [0.9,  0.8,  0.7,  0.6,  0.5,  0.4,  0.3 ],
})
print(dd)

  name key1 key2  val1  val2
0   a1    A    C   0.1   0.9
1   a1    A    D   0.2   0.8
2   a1    B    C   0.3   0.7
3   a1    B    D   0.4   0.6
4   a2    A    C   0.5   0.5
5   a2    A    D   0.6   0.4
6   a2    B    C   0.7   0.3

For 'name' = 'a2', the 'key1' = 'B', and 'key2' = 'D' combination is missing. How would I fill in a new row where 'val1' and 'val2' are set to NaN or some other value, without using a multiindex (which Dask doesn't support)? I'm also interested in a Pandas solution.

Note this is an example, and would have to be done for multiple missing key combinations.

The expected output would be:

  name key1 key2  val1  val2
0   a1    A    C   0.1   0.9
1   a1    A    D   0.2   0.8
2   a1    B    C   0.3   0.7
3   a1    B    D   0.4   0.6
4   a2    A    C   0.5   0.5
5   a2    A    D   0.6   0.4
6   a2    B    C   0.7   0.3
7   a2    B    D   nan   nan

I had exact same problem. I used dd.compute() and used the same way we do in Pandas — Rajnish kumar
@sammywemmy, is it? What if the dataframe doesn't fit into memory? — bill_e

Kate Kate · Accepted Answer · 2020-04-24T19:44:54

You could use create a new data frame with all of the keys that you want, and merge the two data frames.

from itertools import product

fixed_keys = product(['a1', 'a2'], ['A', 'B'], ['C', 'D'])
key_frame = pd.DataFrame(fixed_keys, columns=['name', 'key1', 'key2'])

new_frame = pd.merge(key_frame, dd, on=['name', 'key1', 'key2'], how='left')
print(new_frame)

  name key1 key2  val1  val2
0   a1    A    C   0.1   0.9
1   a1    A    D   0.2   0.8
2   a1    B    C   0.3   0.7
3   a1    B    D   0.4   0.6
4   a2    A    C   0.5   0.5
5   a2    A    D   0.6   0.4
6   a2    B    C   0.7   0.3
7   a2    B    D   nan   nan

If the key_frame is too big, you could do a groupby apply on the key with the most unique values.

fixed_keys_sub = product(['A', 'B'], ['C', 'D'])
key_frame_sub = pd.DataFrame(fixed_keys, columns=['key1', 'key2'])

def func(sub):
    sub = pd.merge(key_frame, sub, on=['key1', 'key2'], how='left')
    sub = sub.drop(columns='name')
    return sub

dd.groupby('name').apply(func).reset_index()

Pandas or Dask dataframe, fill in values based on missing grouping variable combinations

1 Answers