Joining two datasets spark scala

Question

I have two csv files (datasets) file1 and file2.

File1 consists of following columns:

Orders | Requests | Book1  | Book2
Varchar| Integer  | Integer| Integer

File2 consists of following columns:

Book3 | Book4 | Book5  | Orders
String| String| Varchar| Varchar

How to combine the data in two CSV files in scala to check:

how many
- Orders, Book1(Ignore Book1 having value = 0), Book3 and Book4 are present in both files for each Orders
- Note: column Orders is common in both files

What does orders represent? what is the logical relationship between the two files? What is the purpose of this join? — Yaron
Sorry for the confusion. I just took random column names. I want to combine two csv files having common column(Suppose column1) and display columns including common column1 from both files(Suppose column2 from first file and column5,column 6 from second file). So, output will column1, column2, column5, column6. Also, ignore column2 having value = 0. — SolakiR

Sumon Rahman Sumon Rahman · Accepted Answer · 2017-02-26T11:12:08

You can join two csv by making Pair RDD.

  val rightFile = job.patch.get.file
  val rightFileByKeys = sc.textFile(rightFile).map { line =>
    new LineParser(line, job.patch.get.patchKeyIndex, job.delimRegex, Some(job.patch.get.patchValueIndex))
  }.keyBy(_.getKey())

  val leftFileByKeys = sc.textFile(leftFile).map { line =>
    new LineParser(line, job.patch.get.fileKeyIndex, job.delimRegex)
  }.keyBy(_.getKey())

  leftFileByKeys.join(rightFileByKeys).map { case (key, (left, right)) =>
    (job, left.line + job.delim + right.getValue())
  }

Joining two datasets spark scala

1 Answers