0
votes

I have two separate fields (from two separated excel file inputs) to compare within Pentaho Kettle. The fieldname of the first excel file is called "location" and the fieldname of the second excel input is called "direction". Both fields have the datatype "String".

The entries of the "location" field are unique whereas the entries of the "direction" field are NOT unique. For example, they may contain the string "New York" several times.

What I want to do is to compare EACH entry of the "direction"-field with ALL entries of the "location"-field. As the location-field serves as a lookup table, I want to know which direction-entries are NOT listed in the location-entries.

The result should be a number of strings (directions) which do not appear in the location-lookup-table. In order to do this, I will need two loops (which are nested) for comparing each direction-entry with each location-entry. How can I achieve this using Pentaho Kettle? I have already tried it with Merge Rows (diff)- and the Modified Java Script Value-Steps but wasn't successful. Any suggestions on how to solve this?

1

1 Answers

1
votes

Pentaho works using streams, loops are usually not needed.

In the image below I've constructed a simple transformation that should do what you want:

  1. The two Excel inputs should get you your directions and locations, plus any additional fields you need.
  2. The Stream lookup loads all of the records from Locations into memory for comparing (configure Locations Excel Input as the lookup step) and then looks up each row from Directions based on the fields you configure (direction and location).
  3. You should enter at least one field to return from the Locations lookup, an id or Location itself.
  4. The Filter rows then checks if the new field (Location_id, Location, what you have) has some value. If filled, the location exists, if empty, it goes to Not found for further processing.

enter image description here