0
votes

I'm trying to transform the json output of aws glue get-tables command into a PySpark dataframe. After reading the json output with this command:


       df = spark.read.option("inferSchema", "true") \
       .option("multiline", "true") \
       .json("tmp/my_json.json")

I get the following from printSchema:


    root
     |-- TableList: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- CatalogId: string (nullable = true)
     |    |    |-- CreateTime: string (nullable = true)
     |    |    |-- CreatedBy: string (nullable = true)
     |    |    |-- DatabaseName: string (nullable = true)
     |    |    |-- IsRegisteredWithLakeFormation: boolean (nullable = true)
     |    |    |-- LastAccessTime: string (nullable = true)
     |    |    |-- Name: string (nullable = true)
     |    |    |-- Owner: string (nullable = true)
     |    |    |-- Parameters: struct (nullable = true)
     |    |    |    |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
     |    |    |    |-- CrawlerSchemaSerializerVersion: string (nullable = true)
     |    |    |    |-- UPDATED_BY_CRAWLER: string (nullable = true)
     |    |    |    |-- averageRecordSize: string (nullable = true)
     |    |    |    |-- classification: string (nullable = true)
     |    |    |    |-- compressionType: string (nullable = true)
     |    |    |    |-- objectCount: string (nullable = true)
     |    |    |    |-- recordCount: string (nullable = true)
     |    |    |    |-- sizeKey: string (nullable = true)
     |    |    |    |-- spark.sql.create.version: string (nullable = true)
     |    |    |    |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
     |    |    |    |-- spark.sql.sources.schema.numParts: string (nullable = true)
     |    |    |    |-- spark.sql.sources.schema.part.0: string (nullable = true)
     |    |    |    |-- spark.sql.sources.schema.part.1: string (nullable = true)
     |    |    |    |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
     |    |    |    |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
     |    |    |    |-- typeOfData: string (nullable = true)
     |    |    |-- PartitionKeys: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- Name: string (nullable = true)
     |    |    |    |    |-- Type: string (nullable = true)
     |    |    |-- Retention: long (nullable = true)
     |    |    |-- StorageDescriptor: struct (nullable = true)
     |    |    |    |-- BucketColumns: array (nullable = true)
     |    |    |    |    |-- element: string (containsNull = true)
     |    |    |    |-- Columns: array (nullable = true)
     |    |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |    |-- Name: string (nullable = true)
     |    |    |    |    |    |-- Type: string (nullable = true)
     |    |    |    |-- Compressed: boolean (nullable = true)
     |    |    |    |-- InputFormat: string (nullable = true)
     |    |    |    |-- Location: string (nullable = true)
     |    |    |    |-- NumberOfBuckets: long (nullable = true)
     |    |    |    |-- OutputFormat: string (nullable = true)
     |    |    |    |-- Parameters: struct (nullable = true)
     |    |    |    |    |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
     |    |    |    |    |-- CrawlerSchemaSerializerVersion: string (nullable = true)
     |    |    |    |    |-- UPDATED_BY_CRAWLER: string (nullable = true)
     |    |    |    |    |-- averageRecordSize: string (nullable = true)
     |    |    |    |    |-- classification: string (nullable = true)
     |    |    |    |    |-- compressionType: string (nullable = true)
     |    |    |    |    |-- objectCount: string (nullable = true)
     |    |    |    |    |-- recordCount: string (nullable = true)
     |    |    |    |    |-- sizeKey: string (nullable = true)
     |    |    |    |    |-- spark.sql.create.version: string (nullable = true)
     |    |    |    |    |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
     |    |    |    |    |-- spark.sql.sources.schema.numParts: string (nullable = true)
     |    |    |    |    |-- spark.sql.sources.schema.part.0: string (nullable = true)
     |    |    |    |    |-- spark.sql.sources.schema.part.1: string (nullable = true)
     |    |    |    |    |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
     |    |    |    |    |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
     |    |    |    |    |-- typeOfData: string (nullable = true)
     |    |    |    |-- SerdeInfo: struct (nullable = true)
     |    |    |    |    |-- Parameters: struct (nullable = true)
     |    |    |    |    |    |-- serialization.format: string (nullable = true)
     |    |    |    |    |-- SerializationLibrary: string (nullable = true)
     |    |    |    |-- SortColumns: array (nullable = true)
     |    |    |    |    |-- element: string (containsNull = true)
     |    |    |    |-- StoredAsSubDirectories: boolean (nullable = true)
     |    |    |-- TableType: string (nullable = true)
     |    |    |-- UpdateTime: string (nullable = true)

But just one column with the whole json is created in df:


    +--------------------+
    |           TableList|
    +--------------------+
    |[[903342277921, 2...|
    +--------------------+

Is there a way to programmatically (and dinamically) create the dataframe in the same way that is referenced in printSchema?

Thanks in advance!