Hadoop n00b here.
I have installed Hadoop 2.6.0 on a server where I have stored twelve json files I want to perform MapReduce operations on. These files are large, ranging from 2-5 gigabytes each.
The structure of the JSON files is an array of JSON objects. Snippet of two objects below:
[{"campus":"Gløshaugen","building":"Varmeteknisk og Kjelhuset","floor":"4. etasje","timestamp":1412121618,"dayOfWeek":3,"hourOfDay":2,"latitude":63.419161638078066,"salt_timestamp":1412121602,"longitude":10.404867443910122,"id":"961","accuracy":56.083199914753536},{"campus":"Gløshaugen","building":"IT-Vest","floor":"2. etasje","timestamp":1412121612,"dayOfWeek":3,"hourOfDay":2,"latitude":63.41709424828986,"salt_timestamp":1412121602,"longitude":10.402167488838765,"id":"982","accuracy":7.315199988880896}]
I want to perform MapReduce operations based on the fields building and timestamp. At least in the beginning until I get the hang of this. E.g. mapReduce the data where building equals a parameter and timestamp is greater than X and less than Y. The relevant fields I need after the reduce process is latitude and longitude.
I know there are different tools(Hive, HBase, PIG, Spark etc) you can use with Hadoop that might solve this easier, but my boss wants an evaluation of the MapReduce performance of standalone Hadoop.
So far I have created the main class triggering the map and reduce classes, implemented what I believe is a start in the map class, but I'm stuck on the reduce class. Below is what I have so far.
public class Hadoop {
public static void main(String[] args) throws Exception {
try {
Configuration conf = new Configuration();
Job job = new Job(conf, "maze");
job.setJarByClass(Hadoop.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path inPath = new Path("hdfs://xxx.xxx.106.23:50070/data.json");
FileInputFormat.addInputPath(job, inPath);
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}catch (Exception e){
e.printStackTrace();
}
}
}
Mapper:
public class Map extends org.apache.hadoop.mapreduce.Mapper{
private Text word = new Text();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
try {
JSONObject jo = new JSONObject(value.toString());
String latitude = jo.getString("latitude");
String longitude = jo.getString("longitude");
long timestamp = jo.getLong("timestamp");
String building = jo.getString("building");
StringBuilder sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
sb.append("/");
sb.append(timestamp);
sb.append("/");
sb.append(building);
sb.append("/");
context.write(new Text(sb.toString()),value);
}catch (JSONException e){
e.printStackTrace();
}
}
}
Reducer:
public class Reducer extends org.apache.hadoop.mapreduce.Reducer{
private Text result = new Text();
protected void reduce(Text key, Iterable<Text> values, org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException {
}
}
UPDATE
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
private static String BUILDING;
private static int tsFrom;
private static int tsTo;
try {
JSONArray ja = new JSONArray(key.toString());
StringBuilder sb;
for(int n = 0; n < ja.length(); n++)
{
JSONObject jo = ja.getJSONObject(n);
String latitude = jo.getString("latitude");
String longitude = jo.getString("longitude");
int timestamp = jo.getInt("timestamp");
String building = jo.getString("building");
if (BUILDING.equals(building) && timestamp < tsTo && timestamp > tsFrom) {
sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
context.write(new Text(sb.toString()), value);
}
}
}catch (JSONException e){
e.printStackTrace();
}
}
@Override
public void configure(JobConf jobConf) {
System.out.println("configure");
BUILDING = jobConf.get("BUILDING");
tsFrom = Integer.parseInt(jobConf.get("TSFROM"));
tsTo = Integer.parseInt(jobConf.get("TSTO"));
}
This works for a small data set. Since I am working with LARGE json files, I get Java Heap Space exception. Since I am not familiar with Hadoop, I'm having trouble understanding how MapR can read the data without getting outOfMemoryError.