1
votes

I recently installed Greenplum Database and now I am attempting to import data using the GPLOAD utility program and a yaml control file.

My table looks like this (pseudo):

Table1
- column1 integer
- column2 integer
- column3 integer
- column4 character(6)

So the columns in my yaml control file looks like this:

- COLUMNS
       - column1: integer
       - column2: integer
       - column3: integer
       - column4: text

But my CSV content looks like this:

1, 2, 3, "test 1", , ,<br>
3, 2, 0, "test 2", , ,<br>
4, 0, 2, "test 3", , ,<br>

Notice the trailing "empty" fields in the CSV file. Now imagine having 300 fields in the CSV file and only 100 columns in the table. My CSV files will always have 300 or more fields and my columns will always be 100.
I DON'T WANT TO SPECIFY ALL 300 columns in the control file or the table.
I would like to ignore the trailing fields because there will be no values or columns defined for those fields...

I would like to IGNORE TRAILING NULLCOLS much like Oracle Loader does.

I am getting the following error:

"LAST ERROR: Extra data after last expected column"

Any suggestions would be appreciated!

2

2 Answers

2
votes

If you want to use gpload, use the "mapping" feature in the yml file. Here is an example.

First, create a table with 2 columns:

[gpadmin@gpdbsne ~]$ psql
SET
Timing is on.
psql (8.2.15)
Type "help" for help.

gpadmin=# create table public.test (col1 text, col2 text) distributed randomly;
CREATE TABLE
Time: 16.494 ms
gpadmin=# \q

Here is a test file with 5 columns of data but my table only has 2.

[gpadmin@gpdbsne ~]$ cat testfile.txt 
"col1"|"col2"|"col3"|"col4"|"col5"
"col1"|"col2"|"col3"|"col4"|"col5"
"col1"|"col2"|"col3"|"col4"|"col5"
"col1"|"col2"|"col3"|"col4"|"col5"
"col1"|"col2"|"col3"|"col4"|"col5"
"col1"|"col2"|"col3"|"col4"|"col5"
"col1"|"col2"|"col3"|"col4"|"col5"
"col1"|"col2"|"col3"|"col4"|"col5"
"col1"|"col2"|"col3"|"col4"|"col5"
"col1"|"col2"|"col3"|"col4"|"col5"

Now, create a yaml file with the MAPPING of col1 to col1 and col2 to col2 while ignoring the trailing columns.

[gpadmin@gpdbsne ~]$ cat test.yml
---
VERSION: 1.0.0.1 
DATABASE: gpadmin
USER: gpadmin
HOST: gpdbsne
PORT: 5432
GPLOAD:
   INPUT:
    - SOURCE:
    LOCAL_HOSTNAME:
      - gpdbsne
    PORT: 8999
    FILE:
      - /home/gpadmin/testfile.txt
    - FORMAT: text 
    - DELIMITER: '|'
    - QUOTE: '"'
    - COLUMNS:
       - col1: text
       - col2: text
       - col3: text
       - col4: text
       - col5: text
   OUTPUT:
     - TABLE: public.test
     - MODE: insert 
     - MAPPING:
           col1: col1
           col2: col2

Use gpload to load the data.

[gpadmin@gpdbsne ~]$ gpload -f test.yml
2017-01-12 12:25:48|INFO|gpload session started 2017-01-12 12:25:48
2017-01-12 12:25:48|INFO|started gpfdist -p 8999 -P 9000 -f "/home/gpadmin/testfile.txt" -t 30
2017-01-12 12:25:48|INFO|running time: 0.12 seconds
2017-01-12 12:25:48|INFO|rows Inserted          = 10
2017-01-12 12:25:48|INFO|rows Updated           = 0
2017-01-12 12:25:48|INFO|data formatting errors = 0
2017-01-12 12:25:48|INFO|gpload succeeded

Verify the data is there:

[gpadmin@gpdbsne ~]$ psql -c "select * from public.test"
  col1  |  col2  
--------+--------
 "col1" | "col2"
 "col1" | "col2"
 "col1" | "col2"
 "col1" | "col2"
 "col1" | "col2"
 "col1" | "col2"
 "col1" | "col2"
 "col1" | "col2"
 "col1" | "col2"
 "col1" | "col2"
(10 rows)

Alternatively, you could specify an external table and do the insert. This is what gpload is doing too. It is just a wrapper for creating the gpfdist process and loading the data with an external table.

[gpadmin@gpdbsne ~]$ gpfdist -p 8999 > load.log 2>&1 < load.log &
[1] 12840
[gpadmin@gpdbsne ~]$ psql
SET
Timing is on.
psql (8.2.15)
Type "help" for help.

gpadmin=# create external table public.ext_testfile (col1 text, col2 text, col3 text, col4 text, col5 text) location ('gpfdist://gpdbsne:8999/testfile.txt') format 'text' (delimiter '|');
CREATE EXTERNAL TABLE
Time: 7.843 ms
gpadmin=# insert into public.test select col1, col2 from public.ext_testfile;
INSERT 0 10
Time: 36.925 ms
gpadmin=# 
1
votes

Try define 1 column to import the whole data, then use regex or substring to parse it into another table. for example: insert into table second_table select substring(c1,.....) from first_table;