Greenplum ngram dataset: missing data for column

Question

My ultimate goal is to insert the ngram data set into internal load tables of my Greenplum database.

I have a server running gpfdist.

I've created the external tables (on Greenplum):

    CREATE EXTERNAL TABLE ng_schema.fre2(
    id bigserial, 
    ngram text,
    year int4,
    match_count int4,
    page_count int4,
    volume_count int4)
    LOCATION ('gpfdist://10.1.8.4:8081/ngram_2009h1/fre-all/2/*.csv')
    FORMAT 'TEXT' (HEADER) 
    LOG ERRORS INTO ng_schema.load_e_fre2 SEGMENT REJECT LIMIT 500 rows;

The datasets all say "csv", but they don't contain commas. They contain tabs.

   #head -n20 *92.csv
   ! 144    1836    2    2    2
   ....

Next, I test the external table:

    SELECT * FROM fre2;
    ERROR: Segment reject limit reached. Aborting operation. Last error was 
    missing data for column "volume_count"...

I go into my log error table and see this:

  2016-07-21 20:51:49.05593+00  | fre2    | gpfdist://10.1.8.4:8081/ngram_2009h1/fre-all/2/*.csv [/mnt2/ngram_2009h1/fre-all/2/googlebooks-fre-all-2gram-20090715-0.csv] |         |         | missing data for column "volume_count" | ! Giscard       1979    2       2  
 2                 |

I've tried playing around with the FORMAT 'TEXT' (HEADER). I've added DELIMITER ' '. I've changed it to CSV, but I just end up missing data for a different column.

I think that the problem is that "! Giscard" is being counted for 2 columns. The ngram has spaces within it.

Or maybe the error is id bigserial?

Any ideas on what to change?

dlynch dlynch · Accepted Answer · 2016-07-21T22:02:29

the serial column will become a problem for you in this case. But the errors you are seeing appear to be related to the delimiter being used when parsing. Are you using tabs or spaces?

below is a example of using tabs for the delimiter and i create two tables. The external table is used to read the data into the new table. The key here is the external table does not have the bigserial column. When you load the data into the new table the serial id will get generated automatically.

CREATE TABLE fre2_new( 
id bigserial, 
ngram text,
year int4,
match_count int4,
page_count int4,
volume_count int4); 

CREATE EXTERNAL TABLE fre2(
ngram text,
year int4,
match_count int4,
page_count int4,
volume_count int4)
LOCATION ('gpfdist://mdw:8080/dat.txt')
FORMAT 'TEXT' (DELIMITER E'\t')
LOG ERRORS INTO load_e_fre2 SEGMENT REJECT LIMIT 500 rows;


[gpadmin@mdw data]$ cat -vet dat.txt
144^I1836^I2^I2^I2$
144^I1836^I2^I2^I2$
144^I1836^I2^I2^I2$
144^I1836^I2^I2^I2$

insert into fre2_new ( ngram, year, match_count, page_count, volume_count) select * from fre2;

gpadmin=# select * from fre2_new;
 id | ngram | year | match_count | page_count | volume_count
----+-------+------+-------------+------------+--------------
  3 | 144   | 1836 |           2 |          2 |            2
  1 | 144   | 1836 |           2 |          2 |            2
  4 | 144   | 1836 |           2 |          2 |            2
  2 | 144   | 1836 |           2 |          2 |            2

Greenplum ngram dataset: missing data for column

1 Answers