1
votes

I have a JSON datafile and I want to apply a schema to the columns programtically.

pets.json

{"id":"311","species":"canine","color":"golden","weight":"75","name":"Captain"}
{"id":"928","species":"feline","color":"gray","weight":"8","name":"Oscar"}


SparkSession session = SparkSession.builder().appName("SparkSQLTests").master("local[*]").getOrCreate();
        DataFrameReader dataFrameReader = session.read();

        // Create Data Frame
        Dataset<Row> pets = dataFrameReader.schema(buildSchema()).json("input/pets.json");

        // Schema
        pets.printSchema();
        pets.show(10);

        // SELECT * 
        // FROM pets
        // WHERE species='canine'
        System.out.println("=== Display Canines ===");
        pets.filter(col("species").equalTo("canine")).show();


        session.stop();

When I run the program I get nulls for my columns. What am I doing incorrectly? Thanks


    root
     |-- id: integer (nullable = true)
     |-- species: string (nullable = true)
     |-- color: string (nullable = true)
     |-- weight: double (nullable = true)
     |-- name: string (nullable = true)

    +----+-------+-----+------+----+
    |  id|species|color|weight|name|
    +----+-------+-----+------+----+
    |null|   null| null|  null|null|
    |null|   null| null|  null|null|
    +----+-------+-----+------+----+

    === Display Canines ===
    +---+-------+-----+------+----+
    | id|species|color|weight|name|
    +---+-------+-----+------+----+
    +---+-------+-----+------+----+

1

1 Answers

0
votes

It turns out that I had quotes around the numeric values in my json data, which threw things off. It works when I changed the data to:

{"id":311,"species":"canine","color":"golden","weight":75,"name":"Captain"} {"id":928,"species":"feline","color":"gray","weight":8,"name":"Oscar"}