3
votes

I'm writing code using Pandas 0.18/Python 3.5 on an intel i3 (four cores).

I have read this: https://www.continuum.io/content/pandas-releasing-gil

I also have some work that is IO bound (parsing CSV files into dataframes). I have to do a lot of calculation that is mostly multiplying dataframes.

My code is currently parallel using concurrent.futures ThreadPoolExecutor.

My question is:

  • In general, should I be using threads to run pandas jobs in parallel, or does pandas make effective use of all cores without me having to explicitly tell it to? (in which case, I will execute my jobs serially).
1

1 Answers

3
votes

Best I can tell from reading the docs, pandas simply releases the GIL for certain operations:

We are releasing the global-interpreter-lock (GIL) on some cython operations. This will allow other threads to run simultaneously during computation, potentially allowing performance improvements from multi-threading. Notably groupby, nsmallest, value_counts and some indexing operations benefit from this.

All this means is that other threads can be executed by the Python interpreter while the calculations being one by pandas continue. It doesn't mean that pandas automatically scales the calculations across many threads. They sort of mention this in the docs as well:

Releasing of the GIL could benefit an application that uses threads for user interactions (e.g. QT), or performing multi-threaded computations.

In order to get parallelization benefits, you need to actually be creating and executing multiple threads in your own code. So, you should continue using the ThreadPoolExecutor if you're trying to get parallel execution in your application.

Keep in mind that pandas is only releasing the GIL for some operations, so you may not get performance improvements with multiple threads if you're not calling any methods that actually release it.