My team and I are curious to determine the best way we can match Two Different sets of data. There are no keys that can be joined on as this data as is coming from two separate sources that know nothing about each other. We import this data into two oracle tables and once that is done we can begin to look for matches.
Both Tables contain a full list of Properties(As in Real estate). We are needing to match up the Properties in Table1 to any potential matching Properties found in Table2. For each and every record in Table1 search Table2 for a potential match and determine the probability of the match. My team and I have decided that the best way to do this would be to compare the Address fields from each of the two tables.
The one catch is that Table1 provides the Address in a Parsed format and allocates the address number, address Street and even the Address_type into separate columns while Table2 only contains one column to hold the Address. Each table has City, State and Zip columns that can be compared individually.
For Example - See Below Table1 and Table2:
Notice that the Primary Keys in my pseudo tables below are Key1 and Key2 matching the tables they are in.
+---------------+---------------+---------------+---------------+---------------+-------+-------+
+ + TABLE1 + + + + + +
+---------------+---------------+---------------+---------------+---------------+-------+-------+
| Key1 | Addr_Number | Addr_Street | Addr_Type | City | State | Zip |
+---------------+---------------+---------------+---------------+---------------+-------+-------+
| 1001 | 148 | Panas | Road | Robinson | CA | 76050 |
| 1005 | 110 | 48th | Street | San Juan | NJ | 8691 |
| 1009 | 8571 | Commerce | Loop | Vallejo | UT | 83651 |
| 1059 | 714 | Nettleton | Avenue | Vista | TX | 29671 |
| 1185 | 1587 | Orchard | Drive | Albuquerque | PA | 77338 |
+---------------+---------------+---------------+---------------+---------------+-------+-------+
+---------------+----------------------+---------------+---------------+---------------+
+ + TABLE2 + + + +
+---------------+----------------------+---------------+---------------+---------------+
| Key2 | Address | City | State | Zip |
+---------------+----------------------+---------------+---------------+---------------+
| Ax89f | 148 Panas Road | Robinson | CA | 76050 |
| B184a | 110 48th Street | San Juan | NJ | 08691 |
| B99ff | 8571 Commerce Lp | Vallejo | UT | 83651 |
| D81bc | 714 Nettleton Ave | Vista | TX | 29671 |
| F84a2 | 1587 Orachard Dr | Albuquerqu | PA | 77338 |
+---------------+----------------------+---------------+---------------+---------------+
The goal here is to provide an output to the user that simply displays ALL of the records from Table1 and the highest matched record found in Table2. There could of course be many records that are found that could be a potential match but we want to keep this a one to one relationship and not produce Duplicates in this initial output. The output should just be One Record out of Table one matched to the best find in Table2.
See below an example of the Desired output I am attempting to create:
+--------+-------+----------------+---------------------------+
+ + + Matched_Output + +
+--------+-------+----------------+---------------------------+
| Key1 | Key2 | Percent_Match | num_Matched_Records > 90% |
+--------+-------+----------------+---------------------------+
| 1001 | Ax89f | 100% | 5 | --All Parsed Values Match
| 1005 | B184a | 98% | 4 | --Zip Code prefixed with Zero in Table 2
| 1009 | B99ff | 95% | 3 | --Loop Vs Lp
| 1059 | D81bc | 95% | 2 | --Avenue Vs Ave
| 1185 | F84a2 | 97% | 2 | --City Spelled Wrong in Table 2 and Drive vs Dr
+--------+-------+----------------+---------------------------+
In the output I want to see Key1 from Table1 and the matched record right next to it showing that it matches to the record in Table2 to Key2. Next we are needing to know how well these two records match. There could be many records in Table2 that show a probability to matching a records in Table1. In fact every single record in Table2 can be assigned a percentage all the way from 0% up to a 100% match.
So now to the main question: How does one obtain this percentage?
How do I Parse the Address column in Table2 so that I can compare each of the individual columns that make up the address in Table1 and then apply comparison algorithm on each parsed value?
So far this is what my team and myself have come up with (Brainstorming, Spitballin, whatever you want to call it).
We have taken a look at a couple of the built in Oracle Functions to obtain the percentages we are looking for as well as trying to utilize Regular Expressions. If I could hit up Google and get some of their Search Algorithms I would. Obviously I don't have that luxury and must design my own.
regexp_count(table2_city,'(^| )'||REPLACE(table1_city,' ','|')||'($| )') city_score,
regexp_count(table2_city,'(^| )') city_max,
to_char((city_score/city_max)*100, '999G999G999G999G990D00')||'%' city_perc,
The above was just what my team and I used as a proof of concept. We have simply selected these values out of the two tables and run the 'regexp_count' function against that columns. Here are a few other functions that we have taken a look at:
SOUNDEX
REGEXP_LIKE
REGEXP_REPLACE
These functions are great but I'm not sure they can be used in a Single Query between both tables to produce the desired output.
Another idea is that we could create a Function() that takes as its parameters the Address fields we are wanting to use to compare. That function would then search Table2 for the highest probable match and return back to the user the Key2 value out of Table2.
Function(Addr_Number, Addr_Street, Addr_type, City, State) RETURN table2.key2
For example maybe something like this 'could' work:
Select tb1.key1, table2Function(tb1.Addr_Number, tb1.Addr_Street, tb1.Addr_type, tb1.City, tb1.State) As Key2
From Table1 tb1;
Lastly, just know that there is roughly 15k records currently in Table1 and 20k records in Table2. Again... each record in Table 1 needs to be checked against each record in Table 2 for a potential match.
I'm all ears. And thank you in advance for your feedback.