PySpark issue loading json data with schema

Question

I have json data that looks like this (1 object per row):

{
  "id": "c428c2e2-c30c-4864-8c12-458ead4b17f5",
  "weight": 73,
  "topics": {
    "type": 1,
    "values": [
      1,
      2,
      3
    ]
  }
}

When I read in the data without a specified schema, Spark infers topics.values to be an ArrayType but I need it to be a VectorUDT for doing ML tasks. So I am trying to read in the data set using a schema as follows:

    schema = StructType([
        StructField("id", StringType()),
        StructField("weight", IntegerType()),
        StructField("topics", StructType([
            StructField("type", IntegerType()),
            StructField("values", VectorUDT())
        ]))
    ])

When I do this I see the type (using dtype) of the data frame as follows:

[('id', 'string'), ('weight', 'int'), ('topics', 'struct<type:int,values:vector>')]

But there seems to be no actual data in the data frame, as show by using first():

Row(id=None, weight=None, topics=None)

And when I write the data frame to disk, I just see empty braces on each line. Seems odd! What am I doing wrong?

It is not odd. You pass schema which is not applicable for JSON document. — user6022341
@LostInOverflow Can you elaborate? Obviously I am here asking a question because I don't know that. — Evan Zamir
@LostInOverflow Well, your comment did make me realize how to do this correctly. So thanks for that. — Evan Zamir
Glad it was helpful and sorry I didn't have more definitive suggestion. — user6022341

Evan Zamir Evan Zamir · Accepted Answer · 2016-08-31T18:05:17

Well, I figured it out:

Just needed to change the schema a bit:

schema = StructType([
                StructField("id", StringType()),
                StructField("weight", DoubleType()),
                StructField("topics", VectorUDT())
            ])

Now it makes sense.

PySpark issue loading json data with schema

1 Answers