No language has direct access to CPU cache (was cited but I don't have enough rep to have 3 urls...). Which in turn means that there is no way for OpenCL to keep private memory in cache.
In this presentation from AMD
they simply refer to the memory model as a series of memory objects abstracted by the context (page 16). As long as the buffer is available to the devices in the context, they will be readable. When it comes to the different types of kernel memory, you can safely assume that there will be no performance difference between them when running on a CPU instead of a GPU (as there are different types of DRAM).
Keep in mind however, that host memory and local memory would still differ if you are computing on a cluster, which would still require you to take transfer rates into account.
On the second part of your question, please see this article on memory models in OpenCL. There is performance to gain from structuring your program in such a way that you only need to communicate within a given workgroup.
For further reading, please see -http://software.intel.com/sites/landingpage/opencl/optimization-guide/index.htm