0
votes

I cannot manage to read input multiline JSON input files in a Apache Beam pipeline (coded in Python).

I understand that ReadFromFile with a JSON coder reads JSONL files but how to handle files with following format:

[{
  "name": "name1",
  "value": "val1"
},
{
  "name": "name2",
  "value": "val2"
}]

I cam across the FileSystem module which contains the open() function allows to read the entire file (not line by line) but this returns a file handle (as per the documentation)

But what to do afterwards? This might not be the good way to do it, so any idea?

1

1 Answers

1
votes

There is nothing native to read JSON files.

However you can write a ParDo that takes a Filename and parses the file (outputing whatever you want from the file). In this ParDo you can use whatever libraries exist to parse a JSON file.

To generate the list of filepatterns: In Beam Java, this can be done in Beam by using FileIO.match() and FileIO.readMatches to get a PCollection.

For python you will have to implement the match yourself unfortunately. You'll want to do a Create(filepattern), ParDo(expands the filepattern), reshuffle, ParDo(your ParDo that reads the file).