I'm working with Apache Spark on a Cluster using HDFS. As far as I understand, HDFS is distributing files on data-nodes. So if a put a "file.txt" on the filesystem, it will be split into partitions. Now I'm calling
rdd = SparkContext().textFile("hdfs://.../file.txt")
from Apache Spark. Has rdd now automatically the same partitions as "file.txt" on the filesystem? What happens when I call
rdd.repartition(x)
where x > then the partitions used by hdfs? Will Spark physically rearrange the data on hdfs to work locally?
Example: I put a 30GB Textfile on the HDFS-System, which is distributing it on 10 nodes. Will Spark a) use the same 10 partitons? and b) shuffle 30GB across the cluster when I call repartition(1000)?