I need to parse json data from files in GCS which is compressed, since the files extension is .gz so it should be reorganized and handled properly by dataflow, however the job log printed out unreadable characters and data not processed. when I process uncompressed data it worked fine. I used the following method to map/parse json:
ObjectMapper mapper = new ObjectMapper();
Map<String, String> eventDetails = mapper.readValue(c.element(),
new TypeReference<Map<String, String>>() {
});
any idea what could be the cause?
===================================
To add more details about how to read from input files:
to create a pipeline:
Poptions pOptions = PipelineOptionsFactory.fromArgs(args).withValidation().as(Poptions.class); Pipeline p = Pipeline.create(pOptions); p.apply(TextIO.Read.named("ReadLines").from(pOptions.getInput())) .apply(new Pimpression()) .apply(BigQueryIO.Write .to(pOptions.getOutput()) .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER) .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)); p.run();
configuration at run time:
PROJECT="myProjectId" DATASET="myDataSetId" INPUT="gs://foldername/input/*" STAGING1="gs://foldername/staging" TABLE1="myTableName" mvn exec:java -pl example \ -Dexec.mainClass=com.google.cloud.dataflow.examples.Example1 \ -Dexec.args="--project=${PROJECT} --output=${PROJECT}:${DATASET}.${TABLE1} --input=${INPUT} --stagingLocation=${STAGING1} --runner=BlockingDataflowPipelineRunner"
input file name example: file.gz, and the output of command gsutil ls -L gs://bucket/input/file.gz | grep Content- is:
Content-Length: 483100 Content-Type: application/octet-stream
gsutil ls -L gs://foldername/input/exampleinputfile.gz" | grep Content-
We may have to follow up privately, because this doesn't look like intended behavior. – MattL