I have a data frame that looks like the following
I was wondering if there exist a fastest way to create a python dict in pandas that would hold data like following
table = {2: [4, 5, 6, 7, 8 ...], 4: [1, 2, 3, 4, ...]}
Here the keys are users ids and the values are uniques list of dates.
This can be done early in core python but was wondering if there is a pandas or numpy based method to compute this fast. I needed a fast solution that scales well when this data frame grows bigger.
Edit 1: Performances
Time taken: 14.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})
table = levels.drop_duplicates()\
.groupby(0)[1].apply(list)\
.to_dict()
print(table)
Time Taken: 17.4 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
res.reset_index().drop_duplicates(['user_id','date']).groupby('user_id')['date'].apply(list).to_dict()
Time Taken: 294 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a = {k: list(pd.unique(list(zip(*g))[1]))
for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)
Time Taken: 15 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pd.Series(res.index.get_level_values(1), index=res.index.get_level_values(0)).groupby(level=0).apply(set).to_dict()
Edit 2: Benchmarking again
Wrong Result
idx = df.index.droplevel(-1).drop_duplicates()
l1, l2 = idx.levels
mapping = defaultdict(list)
for i, j in zip(l1, l2):
mapping[i].append(j)
Improved Timing: 14.6 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = {k: list(set(list(zip(*g))[1]))
for k, g in groupby(res.index.values.tolist(), itemgetter(0))}
Brad Solomon
solutions and first of mine, I am really curious if faster or not. thank you. – jezrael