Loading file to HDFS with custom chunk structure

Question

I want to load a SegY file onto HDFS of a 3-node Apache Hadoop cluster.

To summarize, the SegY file consists of :

3200 bytes textual header
400 bytes binary header
Variable bytes data

The 99.99% size of the file is due to the variable bytes data which is collection of thousands of contiguous traces. For any SegY file to make sense, it must have the textual header+binary header+at least one trace of data. What I want to achieve is to split a large SegY file onto the Hadoop cluster so that a smaller SegY file is available on each node for local processing.

The scenario is as follows:

The SegY file is large in size(above 10GB) and is resting on the local file system of the NameNode machine
The file is to be split on the nodes in such a way each node has a small SegY file with a strict structure - 3200 bytes textual header + 400 bytes binary header + variable bytes data As obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the format in which the chunks of the larger file are required

Charles Menguy Charles Menguy · Accepted Answer · 2013-01-16T06:58:36

There seems to be a Github project that does something similar:

The load command to suhdp will take SEG-Y or SU formatted files on the local machine, format them for use with Hadoop, and copy them to the Hadoop cluster.

suhdp load -input <local SEG-Y/SU files> -output <HDFS target> [-cwproot <path>]

That may not be exactly what you need, but that seems to be the easiest way I could find to load SEG-Y files into HDFS.

Loading file to HDFS with custom chunk structure

1 Answers