0
votes

I have a set of avro files saved in aws S3 with known schema defined in a .avsc file. Is there a way to create a dataset of objects in spark with the schema defined?

The schema look like:

{
  "type" : "record",
  "name" : "NameRecord",
  "namespace" : "com.XXX.avro",
  "doc" : "XXXXX",
  "fields" : [ {
    "name" : "Metadata",
    "type" : [ "null", {
      "type" : "record",
      "name" : "MetaNameRecord",
      "doc" : "XXXX",
      "fields" : [ {
        "name" : "id",
        "type" : "int"
      }, {
        "name" : "name",
        "type" : [ "null", "string" ],
        "default" : null
      }]
}

I would like to create a dataset of NameRecord: Dataset[NameRecord]

1

1 Answers

1
votes

Avro object files, by definition, already have a schema within them.

Should just need to do this

val df = spark.read.format("avro").load("s3://path")
df.schema

https://spark.apache.org/docs/latest/sql-data-sources-avro.html