2
votes

Hi I have a scenario where the incoming message is a Json which has a header say tablename and the data part has the table column data. Now i want to write this to parquet to separate folders say /emp and /dept. I can achieve this in regular streaming by aggregating rows based on the tablname. But in structured streaming I am unable to split this. How can I achieve this in structured streaming.

{"tableName":"employee","data":{"empid":1","empname":"john","dept":"CS"} {"tableName":"employee","data":{"empid":2","empname":"james","dept":"CS"} {"tableName":"dept","data":{"dept":"1","deptname":"CS","desc":"COMPUTER SCIENCE DEPT"}

1
Not a great solution but I've solved this same problem by creating multiple streams, filtering each, and then writing the filtered data to the location that I specify. I wasn't able to find anything about streaming to different locations dynamically. - Jeremy
thanks a lot Jeremy .so in your case were you reading from Kafka?so , you created separate topics right - Ajith Kannan
I was reading the same topic from multiple streams. - Jeremy

1 Answers

5
votes

i got this working by looping through the list of expected tables and for each of then filter the records from the dataframe and apply the schema & encoder specific to the table and then write to sink . So the read happens only once and for each table writeStream will be called and its working fine. Thanks for all the help

This takes care of dynamic partitioning of the parquet output folder based on the tables as well.