Hadoop: reciprocal of hdfs dfs -text

Question

In Hadoop, the hdfs dfs -text and hdfs dfs -getmerge commands allow one to easily read contents of compressed files in HDFS from the command-line, including piping to other commands for processing (e.g. wc -l <(hdfs dfs -getmerge /whatever 2>/dev/null)).

Is there a reciprocal for these commands, allowing one to push content to HDFS from the command-line, while supporting the same compression and format features as the aforementioned commands? hdfs dfs -put will seemingly just make a raw copy of a local file to HDFS, without compression or container format change.

Answers suggesting command-line tools for manipulating such formats and compression algorithms are welcome too. I typically see Snappy-compressed data in CompressedStream's but can't figure how to convert a plain-old text file (one datum per line) into such a file from the command-line. I gave a try at snzip (as suggested in this askubuntu question) as well as this snappy command-line tool but couldn't use either of them to generate Hadoop-friendly Snappy files (or read the contents of Snappy files ingested in HDFS using Apache Flume).

Shadocko Shadocko · Accepted Answer · 2016-04-15T14:21:43

There is seemingly no reciprocal to hdfs dfs -text and WebHDFS also has no support for (de)compression whatsoever, so I ended up writing my own command-line tool in Java for compressing standard input to standard output in Hadoop-friendly Snappy.

Code goes like this:

class SnappyCompressor {
    static void main(String[] args)
    {
        try {
            Configuration conf = new Configuration();
            CompressionCodecFactory ccf = new CompressionCodecFactory(conf);
            CompressionCodec codec =
                ccf.getCodecByClassName(SnappyCodec.class.getName());
            Compressor comp = CodecPool.getCompressor(codec);
            CompressionOutputStream compOut =
                codec.createOutputStream(System.out, comp);
            BufferedReader in =
                new BufferedReader(new InputStreamReader(System.in));
            String line;
            while( (line=in.readLine()) != null ) {
                compOut.write( line.getBytes() );
                compOut.write( '\n' );
            }
            compOut.finish();
            compOut.close();
        }
        catch( Exception e ) {
            System.err.print("An exception occured: ");
            e.printStackTrace(System.err);
        }
    }
}

Run using hadoop jar <jar path> <class name>.

Text data compressed this way can be put to HDFS (through e.g. hdfs dfs -put or using WebHDFS) then read with hdfs dfs -text.

Hadoop: reciprocal of hdfs dfs -text

2 Answers