2
votes

I use the Apache Hue (User interface) to interact with Hadoop and Hive.

I saved the result of a hive query in a HDFS directory. (The result set is really large)

Then, I downloaded the result file with hue file browser.

Every thing looks fine, but as I opened the csv file, I found the separator is some unreadable code, like this:

enter image description here

How can I solve the separator problem?

2

2 Answers

2
votes

SOH (start of heading) or its Seq equivalent Ctrl + A is the default field delimiter used by Hive. And all the \N represent NULL.

Solution to this depends on the version of Hive used

As of Hive 0.11.0 the separator used can be specified; in earlier versions it was always the ^A character (\001). However, custom separators are only supported for LOCAL writes in Hive versions 0.11.0 to 1.1.0 – this bug is fixed in version 1.2.0

If using Hive >= 1.2.0, you can specify the FIELDS TERMINATED BY clause in your INSERT OVERWRITE statements to choose your delimiter.

INSERT OVERWRITE DIRECTORY hdfs_directory SELECT statement ... 
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ',' ...

Refer HIVE-3682 and HIVE-5672.

2
votes

I will suggest you replace the noisy 'SOH' by "," and remove the '\N' directly.

If you use python, it's just a one-liner:

pd.read_csv("your_file.csv", sep="\001", na_values='\N']).to_csv("your_new_file.csv")