7
votes

I have a data frame that looks like the following

enter image description here

I was wondering if there exist a fastest way to create a python dict in pandas that would hold data like following

table = {2: [4, 5, 6, 7, 8 ...], 4: [1, 2, 3, 4, ...]}

Here the keys are users ids and the values are uniques list of dates.

This can be done early in core python but was wondering if there is a pandas or numpy based method to compute this fast. I needed a fast solution that scales well when this data frame grows bigger.

Edit 1: Performances

Time taken: 14.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})

table = levels.drop_duplicates()\
              .groupby(0)[1].apply(list)\
              .to_dict()

print(table)

Time Taken: 17.4 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

res.reset_index().drop_duplicates(['user_id','date']).groupby('user_id')['date'].apply(list).to_dict()

Time Taken: 294 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

a = {k: list(pd.unique(list(zip(*g))[1])) 
     for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)

Time Taken: 15 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pd.Series(res.index.get_level_values(1), index=res.index.get_level_values(0)).groupby(level=0).apply(set).to_dict()

Edit 2: Benchmarking again

Wrong Result

idx = df.index.droplevel(-1).drop_duplicates()
l1, l2 = idx.levels
mapping = defaultdict(list)
for i, j in zip(l1, l2):
    mapping[i].append(j)

Improved Timing: 14.6 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

a = {k: list(set(list(zip(*g))[1])) 
     for k, g in groupby(res.index.values.tolist(), itemgetter(0))}
3
Is possible add timings for pure python solutions?jezrael
I think Brad Solomon solutions and first of mine, I am really curious if faster or not. thank you.jezrael
@jezrael Updated in the question.Mayukh Sarkar

3 Answers

4
votes

Here's one solution using drop_duplicates + groupby.

levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})

table = levels.drop_duplicates()\
              .groupby(0)[1].apply(list)\
              .to_dict()

print(table)

{1: [2, 3], 2: [8, 9]}

Setup

df = pd.DataFrame([[1, 2, 0, 3], [1, 2, 1, 4], [1, 3, 1, 5],
                   [2, 8, 1, 3], [2, 8, 1, 4], [2, 9, 2, 5]],
                  columns=['col1', 'col2', 'col3', 'col4'])

df = df.set_index(['col1', 'col2', 'col3'])

print(df)

                col4
col1 col2 col3      
1    2    0        3
          1        4
     3    1        5
2    8    1        3
          1        4
     9    2        5
4
votes

Data from Jz

pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).to_dict()
Out[92]: {4: {'a', 'b'}, 5: {'a', 'b'}}

If you just need list , you can add apply(list) PS : Personally do not think this step is needed

pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).apply(list).to_dict()
Out[93]: {4: ['b', 'a'], 5: ['b', 'a']}
2
votes

I think if need better performance, use itertools.groupby with unique for return lists in same ordering as original data. If order is not important use set:

df = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbb')}).set_index(['F','B', 'A'])

print (df)
       C  D  E
F B A         
a 4 a  7  1  5
  5 b  8  3  3
  4 c  9  5  6
b 5 d  4  7  9
    e  2  1  2
  4 f  3  0  4

from  itertools import groupby
from operator import itemgetter

a = {k: list(set(list(zip(*g))[1])) 
     for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)
{'a': [4, 5], 'b': [5, 4]}

Another pandas solution:

d = df.reset_index().drop_duplicates(['F','B']).groupby('F')['B'].apply(list).to_dict()
print (d)
{'a': [4, 5], 'b': [5, 4]}