0
votes

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier

Example Data

Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}

Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
  • There are about 100 rules I can use to compare one record to another to see if customer in source system 1 is the same as source system 2.
  • Some rules may be able to infer record values and add data to a master record about a customer.
  • Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
  • We have millions of records per day we'd have to unify

Apache Beam / Dataflow implementation

  • Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
  • I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
  • Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX

Is there any way you may know in dataflow to process such a problem efficiently?

Other thoughts

  • Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
  • JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
  • Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
  • BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns

Or maybe there's a more standard way to solving these class of problems.

1

1 Answers

2
votes

It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.

From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:

  • records come from a number of sources:
    • it is logically the same data but formatted differently;
  • there are rules to tell if the records represent the same entity:
    • collection of rules is static;

So, the logic probably roughly goes like:

  • read a record;
  • try to find existing matching records;
  • if matching record found:
    • update it with new data;
  • otherwise save the record for future matching;
  • repeat;

To me this looks very high level and there's probably no single 'correct' solution at this level of detail.

I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:

  • what are the properties of the data?
    • are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
  • what are the requirements in general?
    • latency, consistency, availability, etc;
  • how data is read from the sources?
    • can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
    • can the data be read in parallel or is it a single stream?
  • then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
    • can you fit all data in memory;
    • are your rules dynamic. Do they change at all, what happens when they do;
    • can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
    • do you need to match against all of historical/existing data?
    • can you have some quick elimination logic to not do expensive checks?
  • what is the output of the solution? What are the requirements for the output?