1
votes

we are processing multiple files using external table. Is there any way I can get the file name being processed in external tables and stored it in database table?

Only workaround I can find is appending the file name to every record in the flat file which isn't ideal when huge dataset and multiple files.

Can anyone help on this

Thanks

2

2 Answers

2
votes

No, the file name is simply never passed from the gpfdist daemon back to Greenplum. So you have to append the file name to each line - you can use gpfdist transformation for doing so

1
votes

I was struggling with this as well, here's my solution. Please note I'm not an expert in linux, so there may be a one liner solution.

So I wanted to add a filename column in front of my records.

That can be done in sed, I've created a transform.sh file, with the following content:

#/bin/sh
filename=$1
#echo $filename >> transform.txt
sed -e "s|^|$filename\v|" $filename

Please note that I was using vertical tab as a delimiter, \v. Also in the filename you could have / hence using | . In order to have the value of $filename we have to use double quites for sed.

Test it, it looks good.

 ./transform.sh countersamples-2016-03-02--11-51-10.csv
countersamples-2016-03-02--11-51-10.csv
                                       timestamp
                                                machine
                                                       category
                                                               instance
                                                                       name
                                                                           value
countersamples-2016-03-02--11-51-10.csv
                                       2016-03-02 11:51:10.064
                                                              DESKTOP-4PLQKVL
                                                                             Memory

                                                                                   % Committed Bytes In Use
                                                                                                           74.8485488891602

This part is done, lets continue with gpfdist. We need a yaml file that can be passed to gpfdist, I named this transform.yaml Content:

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  add_filename:
     TYPE: input
     CONTENT: data
     COMMAND: /bin/bash transform.sh %filename%

Please note that we have the %filename% value here. It seems that gpfdist prefilters the files that needs to be handled, and passes them 1 by 1 to our transform.

Lets fire up gpfdist:

gpfdist -c transform.yaml -v

Now go into greenplum and create an external table such as:

CREATE READABLE EXTERNAL TABLE "ext_transform"
(
    "filename" text,
  "timestamp" timestamp without time zone  ,
  "machine" text  ,
  "category" text  ,
  "instance" text  ,
  "name" text  ,
  "value" double precision 
) 
  LOCATION ('gpfdist://localhost:8080/*/countersamples*.csv#transform=add_filename') 
 FORMAT 'TEXT' 
 ( HEADER DELIMITER '\013' NULL AS '\\N' ESCAPE AS '\\' )

And when we select data from it:

select * from "ext_transform";

We see: result in greenplum

I've created 2 folders to see how it reacts if the files are not in the same folder as the transform. This way I can distinguish between the 2 files, even if their data is identical.