I am trying to read a JSON file (multi-line) in the pipeline but beam.io.ReadFromText(somefile.json
reads one line at a time.
I am trying to read the content of the file as JSON so that I can apply map
on each category to download relevant products file.
This is how my JSON
file (productindex.json) looks like:
{
"productcategories" : {
"category1" : {
"productfile" : "http://products.somestore.com/category1/products.json"
},
"category2" : {
"productfile" : "http://products.somestore.com/category2/products.json"
},
"category3" : {
"productfile" : "http://products.somestore.com/category3/products.json"
},
"category4" : {
"productfile" : "http://products.somestore.com/category4/products.json"
}
}
This is how the beginning of my pipeline looks like:
with beam.Pipeline(options=pipeline_options) as p:
rows = (
p | beam.io.ReadFromText(
"http://products.somestore.com/allproducts/productindex.json")
)
I am using apache-beam[gcp]
module.
How do I achieve this?
jsonString.replaceAll("\\R", " ")
. That regex will detect newline and return characters. This replacement will flatten your json into a single line. In Python it would be something likejsonString.replace("\n\r", " ")
. – rocksNwaves