Unexpected errors when converting a SparseVector to a DenseVector in PySpark 1.4.1:
from pyspark.mllib.linalg import SparseVector, DenseVector
DenseVector(SparseVector(5, {4: 1.}))
This runs properly on Ubuntu, running pyspark, returning:
DenseVector([0.0, 0.0, 0.0, 0.0, 1.0])
This results into an error on RedHat, running pyspark, returning:
Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/mllib/linalg.py", line 206, in init ar = np.array(ar, dtype=np.float64) File "/usr/lib/spark/python/pyspark/mllib/linalg.py", line 673, in getitem raise ValueError("Index %d out of bounds." % index) ValueError: Index 5 out of bounds.
Also, on both platform, evaluating the following also results into an error:
DenseVector(SparseVector(5, {0: 1.}))
I would expect:
DenseVector([1.0, 0.0, 0.0, 0.0, 0.0])
but get:
- Ubuntu:
Traceback (most recent call last): File "", line 1, in File "/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py", line 206, in init ar = np.array(ar, dtype=np.float64) File "/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py", line 676, in getitem row_ind = inds[insert_index] IndexError: index out of bounds
Note: this error message is different from the previous one, although the error occurs in the same function (code at https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg.html)
- RedHat: the same command results into a Segmentation Fault, which crashes Spark.