0
votes

In my Hive Table, I have a column called YearMonth which has values like 'Jul1998'. I need to split the string in each row into Month ('Jul') and Year ('1998) and add those separately in new columns in Hive. Unfortunately, my code doesn't do it correctly and adds a NULL instance in every other row like shown below:

enter image description here

I took the below steps to get the above output:

Created a split.py file to split the string 'Jul1998' into 'Jul' and '1998

#!/usr/bin/python
import sys

for line in sys.stdin:
    Month= line[:3]
    Year = line[3:]
    print '\t'.join([Month, Year])

Then I go into Hive and add python file into hive directory Finally, I used the transform function to split the column and created two new columns.

ADD FILE /home/ec2-user/split.py;
INSERT OVERWRITE TABLE data
SELECT TRANSFORM (yearmonth) USING 'split.py' AS (month,year) FROM data;
1
Why don't you use SUBSTR? For example: SELECT SUBSTR(yearmonth,1,3) AS month, SUBSTR(yearmonth,4,3) AS year FROM data - pault

1 Answers

4
votes

Assuming your data format is consistent, you can simply use regexp_extract in Hive without the need for a UDF.

SELECT regexp_extract(yearmonth,([^0-9]+)) as mth,regexp_extract(yearmonth,([0-9]+)) as yr
FROM data;

Or a simple SUBSTRING.

select substring(yearmonth,1,3) as mth,substring(yearmonth,4) as yr
from data