Greenplum: gpfdist file serving

0

votes

I'm running through Greenplum tutorial.

I'm having trouble understanding how gpfdist works. What does this mean: gpfdist: Serves data files to or writes data files out from Greenplum Database segments.

What does it mean to "serve a file"? I thought it read external tables. Is gpfdist running on both the client and server? How does it work in parallel? Is it calling gpfdist on several hosts, is that how?

I just need help understanding the big picture. In this tutorial http://greenplum.org/gpdb-sandbox-tutorials/ we call it twice, why? (It's confusing because the server and client are on the same machine.)

greenplum

3

votes

gpfdist can run on any host. It is basically lighttpd that you point to a directory and it sits there and listens for connections on the port you specify.

On the greenplum server/database side, you create and external table definition that uses the LOCATION setting to your gpfdist location.

You can then query this table and gpfdist will "serve the file" to the database engine.

Read: http://gpdb.docs.pivotal.io/4380/utility_guide/admin_utilities/gpfdist.html and http://gpdb.docs.pivotal.io/4380/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html

1

votes

An external table is made up of a few things and the two most important are the location of where to get (or put) data and the other is how to take that data and parse it into something that can be used as table records. When you create the external table you are just creating the definitions of how it should work.

When you execute a query against an external table only then do the segments go out and do what has been setup in that definition. It should be noted they aren't creating a persistent connection or caching that data. Each time you execute that query the cluster is going to look at it's definitions and move that data across the wire and use it for the length of that query.

In the case of gpfdist as an endpoint, it is really just a webserver. People frequently run one on an ETL node. When the location is gpfdist and you create a readable external table each segment will reach out to gpfdist and ask for a chunk of the file and process it. This is the parallelism, multiple segments reaching out to gpfdist and getting chunks they will then try to parse into a tuples according to what was specified in the table definition and then assemble it all to create a table of data for your query.

gpfist can also be the endpoint for a writable external table. In this case the segments are all going to push the data they have to that remote location and gpfdist is going to write the data it was pushed down to disk. The thing to note here is that there is no sort order promised, the data is written to disk as it's streamed from multiple segments.

1

votes

yes, Gpfdist is file distribution service , it used for external tables . An Green plum DB directly query a file like a table from a directory(Unix or windows)

We can select the flat file data and have the further processing. Unicode and wild characters also can be processed with predefined encoding .

External table concepts emerging with the help of gpfdist.

syntax to setup in windows

gpfdist -d ${FLAT_FILES_DIR} -p 8081 -l /tmp/gpfdist.8081.log

Just make sure u have gpdist.exe in yourparticular source machine

Greenplum: gpfdist file serving

3 Answers