What I have?
I have Spark Streaming Application (on Kafka Streams) on Hadoop Cluster that aggregates each 5 minutes users' clicks and some actions done on a web site and converts them into metrics.
Also I have a table in GreenPlum (on its own cluster) with users data that may get updated. This table is filled using Logical Log Streaming Replication via Kafka. Table size is 100 mln users.
What I want?
I want to join Spark Streams with static data from GreenPlum every 1 or 5 minutes and then aggregate data already using e.g. user age from static table.
Notes
Definitely, I don't need to read all records from users table. There are rather stable core segment + number of new users registering each minute. Currently I use PySpark 2.1.0
My solutions
Copy data from GreenPlum cluster to Hadoop cluster and save it as orc/parquet files. Each 5 minute add new files for new users. Once a day reload all files.
Create new DB on Hadoop and Setup Log replication via Kafka as it is done for GreenPlum. Read data from DB and use built in Spark Streaming joins.
Read data from GreenPlum on Spark in cache. Join stream data with cache.
- For each 5 minute save/append new user data in a file, ignore old
user data. Store extra column e.g.
last_actionto truncate this file if a user wasn't active on web site during last 2 weeks. Thus, join this file with stream.
Questions
- What of these solutions are more suitable for MVP? for Production?
- Are there any better solutions/best practices for such sorts of problem. Some literature)