1
votes

I have json files, volume is approx 500 TB. I have loaded complete set into hive data warehouse.

How would I validate or test the data that was loaded into hive warehouse. What should be my testing strategy ?

Client want us to validate the json data. Whether the data loaded into hive is correct ot not. Is there any miss? If yes, which field it was?

Please help.

1
what are the test areas your planing to cover can you please explain it in more detailsMahesh Madushanka
I have updated my question.. Please checkAjay
performing total test will not be possible with this data set and you have to go for random sample test. you can write some hive queries and verify it .Mahesh Madushanka

1 Answers

0
votes

How is your data being stored in hive tables ?

One option is create a Hive UDF function that receive the JSON string and validate the data and return another string with the error message or an empty string if the JSON string is well formed.

Here is a Hve UDF tutorial: http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html

With the Hive UDF function in place you can executequeries like:

select strjson, validateJson(strjson) from jsonTable where validateJson(strjson) != "";