I am not expert in RDD and looking for some answers to get here, I was trying to perform few operations on pyspark RDD but could not achieved , specially with substring.I know I can do that by converting RDD to DF, but wondering how this was being done earlier before pre DF era ? Are companies still prefer to do work in RDD or dataframes ?
My code:
rdd= sc.textFile("Sales.txt")
##Taking only required columns and changing the data types
rdd_map = rdd.map(lambda line: (int((line.split("|")[0])),int((line.split("|")[1])),line.split("|")[4]))
##Filtering the data
rdd_filter = rdd_map.filter(lambda x: (x[0] > 43668) & ('-' in x[2]))
## Trying to perform substring
rdd_clean = rdd_filter.map(lambda x: x.substr(x[2],1,3))
Data sample:
43665|63|OLD ORDER|Sport-100 Helmet, Re|HL-U509-R
43668|87|OLD ORDER|Sport-100 Helmet, Re|HL-U509-R
Complete error message:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 50.0 failed 1 times, most recent failure: Lost task 0.0 in stage 50.0 (TID 152, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Sales.txt
. – cronoik