I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
- There are about 100 rules I can use to compare one record to another to see if customer in source system 1 is the same as source system 2.
- Some rules may be able to infer record values and add data to a master record about a customer.
- Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
- We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
- Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
- I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
- Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with
Flink Gelly
orSpark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
- Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
- JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
- Graph database: Something like
neo4j
ordatomic
would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar - BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.