Would you be able to help in this spark prob statement
Data -
empno|ename|designation|manager|hire_date|sal|deptno
7369|SMITH|CLERK|9902|2010-12-17|800.00|20
7499|ALLEN|SALESMAN|9698|2011-02-20|1600.00|30
Code:
val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|") (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)
})
Problem Statement - This is not working -fields(4).toDate , whats is the alternative or what is the usage ?
What i have tried ?
- tried replacing it to - to_date(col(fields(4)) , "yyy-MM-dd") - Not working
2.
Step 1.
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|")
(fields(0),fields(1),fields(2),fields(3),fields(4),fields(5),fields(6))
})
Now this tuples are all strings
Step 2.
mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))
Step 3. converting the string tuples to Rows
val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))
Step 4.
val empDF = spark.createDataFrame(rowRDD, mySchema)
This is also not working and gives error related to types. to solve this i changed the step 1 as
(fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4),fields(5).toFloat,fields(6).toInt)
Now this is giving error for the date type column and i am again at the main problem.
Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.
This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)
val df = spark.read .option("delimiter", "\\|") .option("header", true) .option("inferSchema", "true") .csv(path)
Should be Should be enough to read a dataframe. – koiralo