I need to load a lot of json files into BQ using Apache Beam in Python. The jsons has a pretty complex schema (with multiple levels of hierarchy), and more importantly - it is not consistent. Some fields are so rare that they appear in only 0.01% percent of the jsons. I can't let BQ infer the schema in the WriteToBigQuery method using AUTO_DETECT because it only examines 100 rows - not nearly enough. I tried building a schema against 0.1% percent of the data using the python generate-schema utility - but again, some fields are so rare that it still fails.
No such field: FIELD_NAME.
I tried finding a way to upload the file regardless of any errors, and saving the errors to an error table, that I can handle separately. However, I didn't find anyway to do so in the WriteToBigQuery module. I also tried validating each json before sending it to the pipeline, but it was extremely slow. I also tried "filtering" the json according to a specified schema, but again that requires going over all of the json - really slow as well, as each json size is about 13 KB.
Did anyone come across anything that can help? It is weird that there isn't any max_rejected attribute to use when writing to BQ using Apache Beam. Any idea on how to handle this will be appreciated.