2
votes

I am trying to create a map-reduce job in Java on table from a HBase database. Using the examples from here and other stuff from the internet, I managed to successfully write a simple row-counter. However, trying to write one that actually does something with the data from a column was unsuccessful, since the received bytes are always null.

A part of my Driver from the job is this:

/* Set main, map and reduce classes */
job.setJarByClass(Driver.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);

/* Get data only from the last 24h */
Timestamp timestamp = new Timestamp(System.currentTimeMillis());
try {
    long now = timestamp.getTime();
    scan.setTimeRange(now - 24 * 60 * 60 * 1000, now);
} catch (IOException e) {
    e.printStackTrace();
}

/* Initialize the initTableMapperJob */
TableMapReduceUtil.initTableMapperJob(
        "dnsr",
        scan,
        Map.class,
        Text.class,
        Text.class,
        job);

/* Set output parameters */
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);

As you can see, the table is called dnsr. My mapper looks like this:

@Override
    public void map(ImmutableBytesWritable row, Result value, Context context)
            throws InterruptedException, IOException {
        byte[] columnValue = value.getValue("d".getBytes(), "fqdn".getBytes());
        if (columnValue == null)
            return;

        byte[] firstSeen = value.getValue("d".getBytes(), "fs".getBytes());
        // if (firstSeen == null)
        //     return;

        String fqdn = new String(columnValue).toLowerCase();
        String fs = (firstSeen == null) ? "empty" : new String(firstSeen);

        context.write(new Text(fqdn), new Text(fs));
    }

Some notes:

  • the column family from the dnsr table is just d. There are multiple columns, some of them being called fqdn and fs (firstSeen);
  • even if the fqdn values appear correctly, the fs are always the "empty" string (I added this check after I had some errors that were saying that you can't convert null to a new string);
  • if I change the fs column name with something else, for example ls (lastSeen), it works;
  • the reducer doesn't do anything, just outputs everything it receives.

I created a simple table scanner in javascript that is querying the exact same table and columns and I can clearly see the values are there. Using the command line and doing queries manually, I can clearly see the fs values are not null, they are bytes that can e later converted into a string (representing a date).

What can be the problem I'm always getting null?

Thanks!

Update: If I get all the columns in a specific column family, I don't receive fs. However, a simple scanner implemented in javascript return fs as a column from the dnsr table.

@Override
public void map(ImmutableBytesWritable row, Result value, Context context)
        throws InterruptedException, IOException {
    byte[] columnValue = value.getValue(columnFamily, fqdnColumnName);
    if (columnValue == null)
        return;
    String fqdn = new String(columnValue).toLowerCase();

    /* Getting all the columns */
    String[] cns = getColumnsInColumnFamily(value, "d");
    StringBuilder sb = new StringBuilder();
    for (String s : cns) {
        sb.append(s).append(";");
    }

    context.write(new Text(fqdn), new Text(sb.toString()));
}

I used an answer from here to get all the column names.

1

1 Answers

0
votes

In the end, I managed to find the 'problem'. Hbase is a column oriented datastore. Here, data is stored and retrieved in columns and hence can read only relevant data if only some data is required. Every column family has one or more column qualifiers (columns) and each column has multiple cells. The interesting part is that every cell has its own timestamp.

Why was this the problem? Well, when you are doing a ranged search, only the cells whose timestamp is in that range are returned, so you may end up with a row with "missing cells". In my case, I had a DNS record and other fields such as firstSeen and lastSeen. lastSeen is a field that is updated every time I see that domain, firstSeen will remain unchanged after the first occurrence. As soon as I changed the ranged map reduce job to a simple map reduce job (using all time data), everything was fine (but the job took longer to finish).

Cheers!