python - Progress indicator during pandas operations

Question

I regularly perform pandas operations on data frames in excess of 15 million or so rows and I'd love to have access to a progress indicator for particular operations.

Does a text based progress indicator for pandas split-apply-combine operations exist?

For example, in something like:

df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)

where feature_rollup is a somewhat involved function that take many DF columns and creates new user columns through various methods. These operations can take a while for large data frames so I'd like to know if it is possible to have text based output in an iPython notebook that updates me on the progress.

So far, I've tried canonical loop progress indicators for Python but they don't interact with pandas in any meaningful way.

I'm hoping there's something I've overlooked in the pandas library/documentation that allows one to know the progress of a split-apply-combine. A simple implementation would maybe look at the total number of data frame subsets upon which the apply function is working and report progress as the completed fraction of those subsets.

Is this perhaps something that needs to be added to the library?

have u done a %prun (profile) on the code? sometimes you can do operations on the whole frame before you apply to eliminate bottlenecks — Jeff
@Jeff: you bet, I did that earlier to squeeze every last bit of performance out of it. The issue really comes down to the pseudo map-reduce boundary I'm working at since the rows are in the tens of millions so I don't expect super speed increases just want some feedback on the progress. — cwharland
@AndyHayden - As I commented on your answer your implementation is quite good and adds a small amount of time to the overall job. I also cythonised three operations inside feature rollup which regained all of the time that is now dedicated reporting progress. So in the end I bet I'll have progress bars with a reduction in total processing time if I follow through with cython on the whole function. — cwharland

casper.dcl casper.dcl · Accepted Answer · 2015-12-18T23:36:34

Due to popular demand, I've added pandas support in tqdm (pip install "tqdm>=4.9.0"). Unlike the other answers, this will not noticeably slow pandas down -- here's an example for DataFrameGroupBy.progress_apply:

import pandas as pd
import numpy as np
from tqdm import tqdm
# from tqdm.auto import tqdm  # for notebooks

# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)

In case you're interested in how this works (and how to modify it for your own callbacks), see the examples on GitHub, the full documentation on PyPI, or import the module and run help(tqdm). Other supported functions include map, applymap, aggregate, and transform.

EDIT

To directly answer the original question, replace:

df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)

with:

from tqdm import tqdm
tqdm.pandas()
df_users.groupby(['userID', 'requestDate']).progress_apply(feature_rollup)

Note: tqdm <= v4.8: For versions of tqdm below 4.8, instead of tqdm.pandas() you had to do:

from tqdm import tqdm, tqdm_pandas
tqdm_pandas(tqdm())

python - Progress indicator during pandas operations

6 Answers