0
votes

I have two large Hive tables, say TableA and TableB (which get loaded from different sources).

These two tables have almost identical table structure / columns with same partition column, a date stored as string.

I need to filter records from each table based on certain (identical) filter criteria.

These tables have some columns containing "codes", which need to be looked up to get its corresponding "values".

There are eight to ten such lookup tables, say, LookupA, LookupB, LookupC, etc.,

Now, I need to:

  1. do a union of those filtered records from TableA and TableB.
  2. do a lookup into the lookup tables and replace those "codes" from the filtered records with their respective "values". If a "code" or "value" is unavailable in the filtered records or lookup table respectively, I need to substitute it with zero or an empty string
  3. transform the dates in the filtered records from one format to another

I am a beginner in Hive. Please let know how I can do it. Thanks.

Note: I can manage till union of the tables. Need some guidance on lookup and transformation.

2
@shankarsh15 has shared the hive-UDF approach towards the lookup. This involves disk I/O contribution from the developer's side. How about using correlated sub queries and let Hive handle the same? Any comments are welcome. Thanks. - Marco99

2 Answers

1
votes

To basically do a lookup Please follow these steps below,

  1. You have to create a custom User Defined function(UDF) which basically does the look up work,meaning you have to create a Java Program internally for looking up, jar it and add it to Hive something like below:

    ADD JAR /home/ubuntu/lookup.jar

  2. You then have to add lookup file containing keyvalue pair as follows:

    ADD FILE /home/ubuntu/lookupA;

  3. You then have to create a temporary lookup function such as

    CREATE TEMPORARY FUNCTION getLookupValueA AS 'com.LookupA';

  4. Finally you have to call this lookup function in the Select query which will basically populate lookup value for the given lookup key.

    Same thing can be achieved using JOIN but that will take a hit on the performance.

Taking a join approach you can very well join by the lookupcode for source and lookup tables something like

select a.key,b.lookupvalue
table a join lookuptable b
where a.key=b.lookupKey

Now for Date Transformation, you can use Date functions in Hive.

0
votes

For the above problem follow the following steps:

  1. Use union schema to union two tables(schema must be same).
  2. For the above scenario you can try pig script. script would look like(jn table A and tableB with lookup table and generate the appropriate columns):
    • a = join TableA by codesA left outer, lookupA by codesA.
    • b = join a by codesB left outer, lookupB by codesB.

Similarly for Table B.

Suppose some value of codesA does not have a value in the lookup table, then:

z = foreach b generate codesA as codesA, valueA is null ? '0' as valuesA.

(will replace all null values from value with 0).

  1. If you are using Pig 0.12 or later, you can use ToString(CurrentTime(),'yyyy-MM-dd')

I hope it will solve your problem. Let me know in case of any concern.