0
votes

I'm trying to create StructType Schema for below JSON

{ 
   "countries":{ 
      "country":[ 
         { 
            "area":9596960,
            "cities":{ 

            },
            "name":"China",
            "population":1210004992
         },
         { 
            "area":3287590,
            "cities":{ 

            },
            "name":"India",
            "population":952107712
         },
         { 
            "area":9372610,
            "cities":{ 
               "city":[ 
                  { 
                     "name":"New York",
                     "population":7380906
                  },
                  { 
                     "name":"Los Angeles",
                     "population":3553638
                  },
                  { 
                     "name":"Chicago",
                     "population":2721547
                  },
                  { 
                     "name":"Detroit",
                     "population":1000272
                  }
               ]
            },
            "name":"United States",
            "population":266476272
         },
         { 
            "area":1919440,
            "cities":{ 
               "city":[ 
                  { 
                     "name":"Jakarta",
                     "population":8259266
                  },
                  { 
                     "name":"Surabaya",
                     "population":2483871
                  },
                  { 
                     "name":"Bandung",
                     "population":2058649
                  },
                  { 
                     "name":"Medan",
                     "population":1730752
                  },
                  { 
                     "name":"Semarang",
                     "population":1250971
                  },
                  { 
                     "name":"Palembang",
                     "population":1144279
                  }
               ]
            },
            "name":"Indonesia",
            "population":206611600
         }
      ]
   }
}

I'm doing below code to get all countries name

DataTypes.createStructField("countries", (new StructType()).add(DataTypes.createStructField("country",
                    (new StructType()).add(DataTypes.createStructField("name", DataTypes.StringType, true)), true)), true)

But when am running below to get all country names

Dataset<Row> namesDF = spark.sql("SELECT countries FROM country");
        namesDF.show();

I'm getting nulls, May I know please how to parse Json fields to get values using StructType ..?

The am doing, Is it the correct way of doing it ..? I'm trying to get country names from above JSON

Update:

Code:

    static final StructType SCHEMA = new StructType(new StructField[] {

            DataTypes.createStructField("countries",
                    new StructType().add(DataTypes.createStructField("country",
                            new ArrayType(new StructType()
                                    .add(DataTypes.createStructField("name", DataTypes.StringType, true)), true),
                            true)),
                    true) }

    );

}

Entry Point

Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
                .json(context.getProperty(GlobalConstants.ReadyJsonFile));

        ds.printSchema();

        ds.createOrReplaceTempView("country_data");
        ds.sqlContext().sql("SELECT country.name FROM country_data lateral view explode(countries.country) t as country").show(false);

Output

root
 |-- countries: struct (nullable = true)
 |    |-- country: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)

+----+
|name|
+----+
+----+

Why Its showing empty name..? I'm using spark 2.4.4

Schema Discovery

root
 |-- countries: struct (nullable = true)
 |    |-- country: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- area: double (nullable = true)
 |    |    |    |-- cities: struct (nullable = true)
 |    |    |    |    |-- city: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- population: long (nullable = true)
1

1 Answers

0
votes

In your json country field contains array not struct, so it leads to schema mismatch. You should create schema with ArrayType like this:

DataTypes.createStructField("countries", 
    new StructType().add(DataTypes.createStructField("country",
                    new ArrayType(new StructType().add(DataTypes.createStructField("name", 
    DataTypes.StringType, true)), true), true)), true)

With this schema you will get countries as follows:

df.registerTempTable("country_data");
spark.sql("SELECT countries FROM country_data").show();
+--------------------------------------------------------------+
|countries                                                     |
+--------------------------------------------------------------+
|[WrappedArray([China], [India], [United States], [Indonesia])]|
+--------------------------------------------------------------+

In case you want to list all countries inside the array you should use explode:

spark.sql("SELECT country.name FROM country_data lateral view explode(countries.country) t as country").show(false)
+-------------+
|name         |
+-------------+
|China        |
|India        |
|United States|
|Indonesia    |
+-------------+