0
votes

Scala.
Spark.
intellij IDEA.

I have a dataframe (multiple rows, multiple columns) from CSV file.
And I want it maps to another specific column info.
I think scala class (not case class, because columns count > 22) or map().....

But I don't know how to convert them.

Example

a dataframe from CSV file.

----------------------
| No  |  price| name |
----------------------
|  1  |  100  |  "A" |
----------------------
|  2  |  200  |  "B" |
----------------------

another specific columns info.

 => {product_id, product_name, seller}

First, product_id is mapping to 'No'. Second, product_name is mapping to 'name'. Third, seller is null or ""(empty string).

So, finally, I want a dataframe that have another columns info.

-----------------------------------------
| product_id  |  product_name  | seller |
-----------------------------------------
|      1      |       "A"      |        |
-----------------------------------------
|      2      |       "B"      |        |
-----------------------------------------
2

2 Answers

0
votes

If you already have a dataframe (eg. old_df) :

val new_df=old_df.withColumnRenamed("No","product_id").
                  withColumnRenamed("name","product_name").
                  drop("price").
                  withColumn("seller", ... )
0
votes

Let's say your CSV file is "products.csv",

First you have to load it in spark, you can do that using

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
     .format("com.databricks.spark.csv")
     .option("header", "true") // Use first line of all files as header
     .option("inferSchema", "true") // Automatically infer data types
     .load("cars.csv")

Once the data is loaded you will have all the column names in the dataframe df. As you mentioned your column name will be "No","Price","Name".

To change the name of the column you just have to use withColumnRenamed api of dataframe.

val renamedDf = df.withColumnRenamed("No","product_id").
   withColumnRenames("name","product_name")

Your renamedDf will have the name of the column as you have assigned.