0
votes

Below is my table creation and a sample from my csv;

DROP TABLE IF EXISTS xxx.fbp;
CREATE TABLE IF NOT EXISTS xxx.fbp (id bigint, p_name string, h_name string, ufi int, city string, country string)
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n'
    STORED AS TEXTFILE;

74905,xxx,xyz,-5420642,City One,France

74993,xxx,zyx,-874432,City,Germany

75729,xxx,yzx,-1284248,City Two Long Name,France

I then load the data into a hive table with the following query:

LOAD DATA
    INPATH '/user/xxx/hdfs_import/fbp.csv'
    INTO TABLE xxx.fbp;

It seems that there is data leaking from the 5th csv "column" into the 6th column of the table. So, I'm seeing city data in my country column.

SELECT country, count(country) from xxx.fbp group by country
+---------+------+
| country | _c1  |
| Germany | 1143 |
| City    |   1  |
+---------+------+

I'm not sure why city data is occasionally being imported to the country column. The csv is downloaded from Google Sheets and I've removed the header.

2

2 Answers

0
votes

The reason could be your line termination is not '\n', Windows based tool add additional characters which creates issue. Also may be you have feilds using column separator creating this.

Solution: 1. Try print line which have issue by 'where country = City' clause, this will give you some idea how Hive created the record. 2. Try binary storage format to be 100% sure about data processed by Hive.

Hope it helps.

0
votes

The issue was within the CSV itself. Some columns, such as p.name contained , in several fields. This would cause a line ending to end sooner than expected. I had to clean the data and remove all ,. After that, it imported correctly. Quickly done with python.

with open("fbp.csv") as infile, open("outfile.csv", "w") as outfile:
    for line in infile:
        outfile.write(line.replace(",", ""))