I'm trying to understand Apache Spark's internals. I wonder if Spark uses some mechanisms to ensure data locality when reading from InputFormat or writing to an OutputFormat (or other formats natively supported by Spark and not derived from MapReduce).
In the first case (reading), my understanding is that, when using InputFormat, the splits get associated with the host (or hosts??) that contain the data so Spark tries to assign tasks to executors in order to reduce network transfer as much as possible.
In the case of writing, how such a mechanism would work? I know that technically, a file in HDFS can be saved in any node locally and replicated to other two (so you use the network for two out of 3 replicas), but, if you consider writing to other systems, such as NoSQL database (Cassandra, HBase, others.. ), such systems have their own way of distributing data. Is there a way to tell spark to partition an RDD in a way that optimize data locality on the basis of the distribution of data expected by the output sink (target NoSQL database, seen natively or through an OutputFormat) ?
I refer to an environment in which Spark nodes and NoSQL nodes live in the same phisical machines.