Spark & Python: strategy for parallelize/map statsmodels sarimax

Question

I've built a Python solution for sarimax (and time series in general) grid search.

It's a python class.

After preparing training & testing sets, the class stores them as object attributes.

Later, the class builds a list containing, in each item, a set of parameters for statsmodels sarimax.

Then, each one of those items is passed to the class sarimax method, for fitting models. Each model is stored on a list for later selection based on the user selected scoring method.

The sarimax method, built within the class, access the training set trough the object attribute (self.df_train)

To train every set of parameters in parallel, I'm calling spark like follows:

spark = SparkSession.builder.getOrCreate()
sca = spark.sparkContext

rdd = sca.parallelize(list_of_parameters)
all_models = rdd.map(self.my_sarimax).collect()

It was perfect for monthly ts starting in 2016. However, if I try to feed it a longer ts, lets say starting 2014, the spark job simply wont start. It will take an eternity 'starting' then it will fail.

The questions are:

1 - as I'm running everything inside the class, Is spark able to understand how to distribute this task?

2 - Can each node (worker) on the cluster easily find the object self.df_train when needed? If not, why it is working for shorter ts? I mean, the thing is a beauty: on average, it takes 10 seconds for training more than 9300 candidate models.

3 - how to make it work with longer ts?

E.ZY. E.ZY. · Accepted Answer · 2019-11-25T19:17:10

Is spark able to understand how to distribute this task?
- Yes, though each spark worker are operates under jvm, but if you have python process distributed to worker (like in your case my_sarimax), each worker will open separate python process to run your code.
- I dont see your complete code snippet, but based on my understanding of the question. You are preparing rdd of potential params, then broadcast the model and training dataset to partitions, then run all params in parallel.
Can each node (worker) on the cluster easily find the object self.df_train when needed? If not, why it is working for shorter?
- If you broadcast the class to all partitions, the class will live on the partition / each worker node.
- But if you broadcast the class, depends on the training data, class might be too big and take a lot of time to serialize and deserialize, so the programming cant run.
- your application either failed on OOM error, or because it took long time to transfer the data, the workers has no heartbeat and gets killed (this might explain why smaller dataset, your approach works fine)

Spark & Python: strategy for parallelize/map statsmodels sarimax

1 Answers