I am developing a code to perform a few very large computations by my standards. Based on single-CPU estimates, expected run-time is ~10 CPU years, and memory requirements are ~64 GB. Little to no IO is required. My serial version of the code in question (written in C) is working well enough and I have to start thinking about how to best parallelize the code.
I have access to clusters with ~64 GB RAM and 16 cores per node. I will probably limit myself to using e.g. <= 8 nodes. I'm imagining a setup where memory is shared between threads on a single node, with separate memory used on different nodes and relatively little communication between nodes.
From what I've read so far, the solution I have come up with is to use a hybrid OpenMP + OpenMPI design, using OpenMP to manage threads on individual compute nodes, and OpenMPI to pass information between nodes, like this: https://www.rc.colorado.edu/crcdocs/openmpi-openmp
My question is whether this is the "best" way to implement this parallelization. I'm an experienced C programmer but have very limited experience in parallel programming (a little bit with OpenMP, none with OpenMPI; most of my jobs in the past were embarrassingly parallel). As an alternative suggestion, is it possible with OpenMPI to efficiently share memory on a single host? If so then I could avoid using OpenMP, which would make things slightly simpler (one API instead of two).