2
votes

I'm trying to make the code faster in Julia using parallelization. My code has nested serial for-loops and performs value function iteration. (as decribed in http://www.parallelecon.com/vfi/)

The following link shows the serial and parallelized version of the code I wrote: https://github.com/minsuc/MyProject/blob/master/VFI_parallel.ipynb (You can find the functions defined in DefinitionPara.jl in the github page too.) Serial code is defined as main() and parallel code is defined as main_paral().

The third for-loop in main() is the step where I find the maximizer given (nCapital, nProductivity). As suggested in the official parallel documentation, I distribute the work over nCapital grid, which consists of many points.

When I do @time for the serial and the parallel code, I get

Serial: 0.001041 seconds

Parallel: 0.004515 seconds

My questions are as follows:

1) I added two workers and each of them works for 0.000714 seconds and 0.000640 seconds as you can see in the ipython notebook. The reason why parallel code is slower is due to the cost of overhead?

2) I increased the number of grid points by changing

vGridCapital = collect(0.5*capitalSteadyState:0.000001:1.5*capitalSteadyState)

Even though each worker does significant amount of work, serial code is way faster than the parallel code. When I added more workers, serial code is still faster. I think something is wrong but I haven't been able to figure out... Could it be related to the fact that I pass too many arguments in the parallelized function

final_shared(mValueFunctionNew, mPolicyFunction, pparams, vGridCapital, mOutput, expectedValueFunction)?

I will really appreciate your comments and suggestions!

1
There could also be a hardware limitation, if you have only one processor your parallel code is actually still running serially with extra overhead.TheoretiCAL
I have four cores (8 logical workers). I am suspecting that using SharedArray is making the code slower than the serial code. Could it be the case?Minsu Chang

1 Answers

2
votes

If the amount of work is really small between synchronizations, the task sync overhead may be too long. Remember that a common OS timeslicing quantum is 10ms, and you are measuring in the 1ms range, so with a bit of load, 4ms latency for getting all work threads synced is perfectly reasonable.

In the case of all tasks accessing the same shared data structure, access locking overhead may well be the culprit, if the shared data structure is thread safe, even with longer parallel tasks.

In some cases, it may be possible to use non-thread-safe shared arrays for both input and output, but then it must be ensured that the workers don't clobber each other's results.

Depending on what exactly the work threads are doing, for example if they are outputting to the same array elements, it might be necessary to give each worker its own output array, and merge them together in the end, but that doesn't seem to be the case with your task.