My ultimate goal is to insert the ngram data set into internal load tables of my Greenplum database.
I have a server running gpfdist.
I've created the external tables (on Greenplum):
CREATE EXTERNAL TABLE ng_schema.fre2(
id bigserial,
ngram text,
year int4,
match_count int4,
page_count int4,
volume_count int4)
LOCATION ('gpfdist://10.1.8.4:8081/ngram_2009h1/fre-all/2/*.csv')
FORMAT 'TEXT' (HEADER)
LOG ERRORS INTO ng_schema.load_e_fre2 SEGMENT REJECT LIMIT 500 rows;
The datasets all say "csv", but they don't contain commas. They contain tabs.
#head -n20 *92.csv
! 144 1836 2 2 2
....
Next, I test the external table:
SELECT * FROM fre2;
ERROR: Segment reject limit reached. Aborting operation. Last error was
missing data for column "volume_count"...
I go into my log error table and see this:
2016-07-21 20:51:49.05593+00 | fre2 | gpfdist://10.1.8.4:8081/ngram_2009h1/fre-all/2/*.csv [/mnt2/ngram_2009h1/fre-all/2/googlebooks-fre-all-2gram-20090715-0.csv] | | | missing data for column "volume_count" | ! Giscard 1979 2 2
2 |
I've tried playing around with the FORMAT 'TEXT' (HEADER). I've added DELIMITER ' '. I've changed it to CSV, but I just end up missing data for a different column.
I think that the problem is that "! Giscard" is being counted for 2 columns. The ngram has spaces within it.
Or maybe the error is id bigserial?
Any ideas on what to change?