In general none of the conditions will be satisfied and zip is not a good tool to perform operation like this. Both number of partitions and number of elements per partition depend not only on a number of lines but also size of the file, size of the individual files and configuration.
zip is useful when you connect RDDs which can common ancestor and are not separated by shuffle for example:
parent = sc.parallelize(range(100))
child1 = parent.map(some_func)
child2 = parent.map(other_func)
child1.zip(child2)
To merge RDDs by line you can do something like this:
def index_and_sort(rdd):
def swap(xy):
x, y = xy
return y, x
return rdd.zipWithIndex().map(swap).sortByKey()
index_and_sort(lines1).join(index_and_sort(lines)).values()
It should be safe to zip after indexing and sorting:
from pyspark import RDD
RDD.zip(*(index_and_sort(rdd).values() for rdd in [lines1, lines2]))
but why even bother?
Scala equivalent:
import org.apache.spark.rdd.RDD
def indexAndSort(rdd: RDD[String]) = rdd.zipWithIndex.map(_.swap).sortByKey()
indexAndSort(lines1).join(indexAndSort(lines2)).values