The answers is not simple or straightforward and splitting the task into one programme per CPU is likely to be non-optimal and may indeed be poor or even extremely poor.
First, as I understand it, you have seven quad-core CPUs (presumably there are eight, but you're saving one for the OS?). If you run a single threaded process on each CPU, you will be using a single thread on a single core. The other three cores and all of the hyperthreads will not be used.
The hardware and OS cannot split a single thread over multiple cores.
You could however run four single-threaded processes per CPU (one per core), or even eight (one per hyperthread). Whether or not this is optimal depends on the work being done by the processes; in particular, their working set size and memory access patterns, and upon the hardware cache arrangements; the number of levels of cache, their sizes and their sharing. Also the NUMA arrangement of the cores needs to be considered.
Basically speaking, an extra thread has to give you quite a bit of speed-up to outweigh what it can cost you in cache utilization, main memory accesses and the disruption of pre-fetching.
Furthermore, because the effects of the working set exceeding certain caching limits is profound, what seems fine for say one or two cores may be appalling for four or eight, so you can't even experiment with one core and assume the results are useful over eight.
Having a quick look, I see i7 has a small L2 cache and a huge L3 cache. Given your data set, I wouldn't be surprised if there's a lot of data being processed. The question is whether or not it is sequentially processed (e.g. if prefetching will be effective). If the data is not sequentially processed, you may do better by reducing the number of concurrent processes, so their working sets tend to fit inside the L3 cache. I suspect if you run eight or sixteen processes, the L3 cache will be hammered - overflowed. OTOH, if your data access is non-sequential, the L3 cache prolly isn't going to save you anyway.