2
votes

I can read Json and printSchema, but running any actions fails (No input paths specified in job).

val sc = new org.apache.spark.SparkContext("local[*]", "shell")
val sqlCtx = new SQLContext(sc)
val input = sqlCtx.jsonFile("../data/tweets/")
input.printSchema

root
|-- contributorsIDs: array (nullable = true)
| |-- element: string (containsNull = true)
|-- createdAt: string (nullable = true)
...

input.first
java.io.IOException: No input paths specified in job

Folder structure looks like:

  • tweets
    • tweets_1444576960000
      • _SUCCESS
      • part-00000
    • tweets_1444577070000
      • _SUCCESS
      • part-00000

Notes:

  • I am using Spark and Spark SQL version 1.5.0
  • Executors are local[*] on same machine
  • I tried replacing the file path with absolute path. Same error
  • Json tweets were fetched using databrick's example app here
1
If you want to try recursively fetching directories, there seems to be a solution here.Rohan Aletty

1 Answers

5
votes

Ok, problem solved by specifying the path like

val input = sqlCtx.jsonFile("../data/tweets/tweets_*/*")