1
votes

I have a file with data having text and "^" in between:

SOME TEXT^GOES HERE^
AND A FEW^MORE
GOES HERE

I am writing a custom input format to delimit the rows using "^" character. i.e The output of the mapper should be like:

SOME TEXT
GOES HERE
AND A FEW
MORE GOES HERE

I have written a written a custom input format which extends FileInputFormat and also written a custom record reader that extends RecordReader. Code for my custom record reader is given below. I dont know how to proceed with this code. Having trouble with the nextKeyValue() method in the WHILE loop part. How should I read the data from a split and generate my custom key-value? I am using all new mapreduce package instead of the old mapred package.

public class MyRecordReader extends RecordReader<LongWritable, Text>
    {
        long start, current, end;
        Text value;
        LongWritable key;
        LineReader reader;
        FileSplit split;
        Path path;
        FileSystem fs;
        FSDataInputStream in;
        Configuration conf;

        @Override
        public void initialize(InputSplit inputSplit, TaskAttemptContext cont) throws IOException, InterruptedException
        {
            conf = cont.getConfiguration();
            split = (FileSplit)inputSplit;
            path = split.getPath();
            fs = path.getFileSystem(conf);
            in = fs.open(path);
            reader = new LineReader(in, conf);
            start = split.getStart();
            current = start;
            end = split.getLength() + start;
        }

        @Override
        public boolean nextKeyValue() throws IOException
        {
            if(key==null)
                key = new LongWritable();

            key.set(current);
            if(value==null)
                value = new Text();

            long readSize = 0;
            while(current<end)
            {
                Text tmpText = new Text(); 
                readSize = read //here how should i read data from the split, and generate key-value?

                if(readSize==0)
                    break;

                current+=readSize;              
            }

            if(readSize==0)
            {
                key = null;
                value = null;
                return false;
            }

            return true;

        }

        @Override
        public float getProgress() throws IOException
        {

        }

        @Override
        public LongWritable getCurrentKey() throws IOException
        {

        }

        @Override
        public Text getCurrentValue() throws IOException
        {

        }

        @Override
        public void close() throws IOException
        {

        }


    }
1

1 Answers

9
votes

There is no need to implement that yourself. You can simply set the configuration value textinputformat.record.delimiter to be the circumflex character.

conf.set("textinputformat.record.delimiter", "^");

This should work fine with the normal TextInputFormat.