What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?

Question

I would like to write data from flume-ng to Google Cloud Storage. It is a little bit complicated, because I observed a very strange behavior. Let me explain:

Introduction

I've launched a hadoop cluster on google cloud (one click) set up to use a bucket.

When I ssh on the master and add a file with hdfs command, I can see it immediately in my bucket

$ hadoop fs -ls /
14/11/27 15:01:41 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2
Found 1 items
-rwx------   3 hadoop hadoop         40 2014-11-27 13:45 /test.txt

But when I try to add then read from my computer, it seems to use some other HDFS. Here I added a file called jp.txt, and it doesn't show my previous file test.txt

$ hadoop fs -ls hdfs://ip.to.my.cluster/
Found 1 items
-rw-r--r--   3 jp supergroup          0 2014-11-27 14:57 hdfs://ip.to.my.cluster/jp.txt

That's also the only file I see when I explore HDFS on http://ip.to.my.cluster:50070/explorer.html#/

When I list files in my bucket with the web console (https://console.developers.google.com/project/my-project-id/storage/my-bucket/), I can only see test.txt and not jp.txt.

I read Hadoop cannot connect to Google Cloud Storage and I configured my hadoop client accordingly (pretty hard stuff) and now I can see items in my bucket. But for that, I need to use a gs:// URI

$ hadoop fs -ls gs://my-bucket/
14/11/27 15:57:46 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.3.0-hadoop2
Found 1 items
-rwx------   3 jp jp         40 2014-11-27 14:45 gs://my-bucket/test.txt

Observation / Intermediate conclusion

So it seems here there are 2 different storages engine in the same cluster: "traditional HDFS" (starting with hdfs://) and a Google storage bucket (starting with gs://).

Users and rights are different, depending on where you are listing files from.

Question(s)

The main question is: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume ?

My flume configuration

a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem

# Describe/configure the source
a1.sources.http.type =  org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000

# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = hdfs://ip.to.my.cluster:8020/%{env}/%{tenant}/%{type}/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S_
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute

# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 1000
a1.channels.mem.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem

Does the line a1.sinks.hdfs_sink.hdfs.path accept a gs:// path ?

What setup would it need in that case (additional jars, classpath) ?

Thanks

Dennis Huo Dennis Huo · Accepted Answer · 2014-11-28T17:43:50

As you observed, it's actually fairly common to be able to access different storage systems from the same Hadoop cluster, based on the scheme:// of the URI you use with hadoop fs. The cluster you deployed on Google Compute Engine also has both filesystems available, it just happens to have the "default" set to gs://your-configbucket.

The reason you had to include the gs://configbucket/file instead of just plain /file on your local cluster is that in your one-click deployment, we additionally included a key in your Hadoop's core-site.xml, setting fs.default.name to be gs://configbucket/. You can achieve the same effect on your local cluster to make it use GCS for all the schemeless paths; in your one-click cluster, check out /home/hadoop/hadoop-install/core-site.xml for a reference of what you might carry over to your local setup.

To explain the internals of Hadoop a bit, the reason hdfs:// paths work normally is actually because there is a configuration key which in theory can be overridden in Hadoop's core-site.xml file, which by default sets:

<property>
  <name>fs.hdfs.impl</name>
  <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
  <description>The FileSystem for hdfs: uris.</description>
</property>

Similarly, you may have noticed that to get gs:// to work on your local cluster, you provided fs.gs.impl. This is because DistribtedFileSystem and GoogleHadoopFileSystem both implement the same Hadoop Java interface FileSystem, and Hadoop is built to be agnostic to how an implementation chooses to actually implement the FileSystem methods. This also means that at the most basic level, anywhere you could normally use hdfs:// you should be able to use gs://.

So, to answer your questions:

The same minimal setup you'd use to get Flume working with a typical HDFS-based setup should work for using GCS as a sink.
You don't need to launch the cluster on Google Compute Engine, though it'd be easier, as you experienced with the more difficult manual instructions for using the GCS connector on your local setup. But since you already have a local setup running, it's up to you whether Google Compute Engine will be an easier place to run your Hadoop/Flume cluster.
Yes, as mentioned above, you should experiment with replacing hdfs:// paths with gs:// paths instead, and/or setting fs.default.name to be your root gs://configbucket path.
Having the two storage engines allows you to more easily switch between the two in case of incompatibilities. There are some minor differences in supported features, for example GCS won't have the same kinds of posix-style permissions you have in HDFS. Also, it doesn't support appends to existing files or symlinks.

What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?

Introduction

Observation / Intermediate conclusion

Question(s)

Related questions

My flume configuration

1 Answers