I've built a Python solution for sarimax (and time series in general) grid search.
It's a python class.
After preparing training & testing sets, the class stores them as object attributes.
Later, the class builds a list containing, in each item, a set of parameters for statsmodels sarimax.
Then, each one of those items is passed to the class sarimax method, for fitting models. Each model is stored on a list for later selection based on the user selected scoring method.
The sarimax method, built within the class, access the training set trough the object attribute (self.df_train)
To train every set of parameters in parallel, I'm calling spark like follows:
spark = SparkSession.builder.getOrCreate()
sca = spark.sparkContext
rdd = sca.parallelize(list_of_parameters)
all_models = rdd.map(self.my_sarimax).collect()
It was perfect for monthly ts starting in 2016. However, if I try to feed it a longer ts, lets say starting 2014, the spark job simply wont start. It will take an eternity 'starting' then it will fail.
The questions are:
1 - as I'm running everything inside the class, Is spark able to understand how to distribute this task?
2 - Can each node (worker) on the cluster easily find the object self.df_train when needed? If not, why it is working for shorter ts? I mean, the thing is a beauty: on average, it takes 10 seconds for training more than 9300 candidate models.
3 - how to make it work with longer ts?