0
votes

I am tasked with redesigning an existing catalog processor and the requirement goes as belowRequirement I have 5 to 10 vendors(each vendor can have multiple stores) who would provide me with 'XML' file per store. Basically, 1 products xml file per Store, and multiple Store files per Vendor. Max file size can be 500 MB and min can be 100 MB Avg products per file could be 100,000.

Sample xml format could be like this ... ... ...

It doesnt take more than 30 mins to download the file per store, and these files are updated once per day or every 3 to 6 hours.

Now priority requirement is that, the product details are highly unorganized and these files have to organized, processed(10+ processes) and converted to another common object(json) and then file stored in Cassandra.

My technology head advised me to design with Apache Flink and Kafka on top of HDFS, where flink directly stream the files from the vendor servers and start processing them while streaming.

My view was that, either case the files are of finite size and there is not much need to stream them. So thought of having a standalone scheduler come downloader to download and load the files to HDFS. As soon as the files are loaded to HDFS, I can trigger the Flink processing and store the same in Cassandra.

My question here is that, knowing the files are of finite size and finite counts irrespsective of the number of vendors, Is stream processing a overkill or a Batch processing would be a latency burden later?

1

1 Answers

0
votes

The question is highly dependent on the tool you will use. If you go for Flink I believe that using the stream is fine and won't create problem in the long run. If you write your functions and jobs properly, moving from DataStream API to DataSet API would be easy, if needed. Batch here introduces an useless delay and without further informations doesn't seem the appropriate approach. I believe it would work fine anyway but it's not clear if latency is a strict requirement.

That said, I believe Flink in itself is an overkill. In this particular use case a more traditional like Spark would be a better option in terms of usability but if you want to invest on Flink, it's totally fine and given the use case, I don't think you will need any particular library that is present/integrated with spark but missing on Flink.