0
votes

I am totally desperate!

I am using apache flink with java and I would like to know if is it possible to modify the keyby method in order to key by similarities and not by the exact name?

I have two different DataStreams and I am doing a union. In the first stream , the name of the field that I want to KeyBy is "John Locke", while in the second Datastream the field value is "John L".

I have an algorithm that gives me an score between some different strings . My idea is: if the score between both strings is higher than 0'80 for example, then those two strings will be consider the same and when I apply the keyby("name") those similar string will be keyed as they have the exact same name.

Visual example:

datastream1----- John Locke, Mickey Micke, Will Williams

satastream2----- Mickey M., John L., Anthony Brown

Datastream d3= datastream1.union(datastream2)

d3.keyby the score/ the similatiry, not the exact name.

I hope you understand, thanks!

2
This is not supported. Maybe you can build a custom solution for it, but I am not sure how... After you do the union and keyBy, what would be the next step to process your records? - Matthias J. Sax
Yes, after that I want to process the records. It was just an example - iñaki
Sure. But what do you want to do specifically? - Matthias J. Sax
Well, actually I want to make a connected stream. I want to create a coflatmap - iñaki
But a CoFlatMap requires two input streams, not one. So why the union? - Matthias J. Sax

2 Answers

0
votes

I think your requirement will be hard to implement efficiently. The reasons is the following situation:

  • sim(A,B) = 0.9
  • sim(A,D) = sim(B,D) = 0.7
  • sim(A,C) = 0.9
  • sim(C,D) = 0.9

If the order of the elements is A,B,D,C, you have to repartition on the arrival of event C. In general, the groups can change with every element that arrives.

What you could do alternatively is to use a KeySelector, which does some kind of stemming, regularization and to key on the n

0
votes

As long as the keys are deterministic you can use a key selector here is a basic example given that the first name always follows the last name.

A key Selector converts a value or set of value to a key that ids a collection in a datastream

place this inside the keyby function or create a class

new KeySelector<String, String key>() {
            @Override
            public Object getKey(String value) throws Exception {

             String[] fullnameArr = value.split(" ");
             String[] NameChar = fullnameArr[fullnameArr.length-1].split("");
             
            
      
              return FullnameArr[0] + NameChar[0];
            } 

so all names will result in JohnL , TomT , CarlS, TonyI - deterministic keys