Spark Streaming join with GreenPlum/Postgres Data. Approach

Question

What I have?

I have Spark Streaming Application (on Kafka Streams) on Hadoop Cluster that aggregates each 5 minutes users' clicks and some actions done on a web site and converts them into metrics.
Also I have a table in GreenPlum (on its own cluster) with users data that may get updated. This table is filled using Logical Log Streaming Replication via Kafka. Table size is 100 mln users.

What I want?

I want to join Spark Streams with static data from GreenPlum every 1 or 5 minutes and then aggregate data already using e.g. user age from static table.

Notes

Definitely, I don't need to read all records from users table. There are rather stable core segment + number of new users registering each minute. Currently I use PySpark 2.1.0

My solutions

Copy data from GreenPlum cluster to Hadoop cluster and save it as orc/parquet files. Each 5 minute add new files for new users. Once a day reload all files.
Create new DB on Hadoop and Setup Log replication via Kafka as it is done for GreenPlum. Read data from DB and use built in Spark Streaming joins.
Read data from GreenPlum on Spark in cache. Join stream data with cache.
For each 5 minute save/append new user data in a file, ignore old user data. Store extra column e.g. last_action to truncate this file if a user wasn't active on web site during last 2 weeks. Thus, join this file with stream.

Questions

What of these solutions are more suitable for MVP? for Production?
Are there any better solutions/best practices for such sorts of problem. Some literature)

sridhar paladugu sridhar paladugu · Accepted Answer · 2018-06-06T13:02:57

Spark streaming reading data from a cache like Apache geode make this better. used this approach in real-time fraud use case. In a nut shell I have features generated on Greenplum Database using historical data. The feature data and some decision making lookup data is pushed in to geode. Features are periodically refreshed (10 min interval) and then refreshed in geode. Spark scoring streaming job constantly scoring the transactions as the come in w/o reading from Greenplum. Also spark streaming job puts the score in geode, which is synced to Greenplum using different thread. I had spark streaming running on cloud foundry using k8. This is a very high level but should give you an idea.

Spark Streaming join with GreenPlum/Postgres Data. Approach

2 Answers