mahout kmeans clustering : showing error

Question

I was trying to cluster data in mahout. An error is showing. here is the error

java.lang.ArrayIndexOutOfBoundsException: 0
    at org.apache.mahout.clustering.classify.ClusterClassificationMapper.populateClusterModels(ClusterClassificationMapper.java:129)
    at org.apache.mahout.clustering.classify.ClusterClassificationMapper.setup(ClusterClassificationMapper.java:74)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
13/03/07 19:29:31 INFO mapred.JobClient:  map 0% reduce 0%
13/03/07 19:29:31 INFO mapred.JobClient: Job complete: job_local_0010
13/03/07 19:29:31 INFO mapred.JobClient: Counters: 0
java.lang.InterruptedException: Cluster Classification Driver Job failed processing E:/Thesis/Experiments/Mahout dataset/input
    at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
    at org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.clusterData(KMeansDriver.java:260)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:152)
    at com.ifm.dataclustering.SequencePrep.<init>(SequencePrep.java:95)
    at com.ifm.dataclustering.App.main(App.java:8)

here is my code

Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(conf);

            Path vector_path = new Path("E:/Thesis/Experiments/Mahout dataset/input/vector_input");
            SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, vector_path, Text.class, VectorWritable.class);
            VectorWritable vec = new VectorWritable();
            for (NamedVector outputVec : vector) {
                vec.set(outputVec);
                writer.append(new Text(outputVec.getName()), vec);
            }
            writer.close();

            // create initial cluster
            Path cluster_path = new Path("E:/Thesis/Experiments/Mahout dataset/clusters/part-00000");
            SequenceFile.Writer cluster_writer = new SequenceFile.Writer(fs, conf, cluster_path, Text.class, Kluster.class);


            // number of cluster k
            int k=4;
            for(i=0;i<k;i++) {
                NamedVector outputVec = vector.get(i);
                Kluster cluster = new Kluster(outputVec, i, new EuclideanDistanceMeasure());
//                System.out.println(cluster);
                cluster_writer.append(new Text(cluster.getIdentifier()), cluster);
            }            
            cluster_writer.close();


            // set cluster output path
            Path output = new Path("E:/Thesis/Experiments/Mahout dataset/output");
            HadoopUtil.delete(conf, output);

            KMeansDriver.run(conf, new Path("E:/Thesis/Experiments/Mahout dataset/input"), new Path("E:/Thesis/Experiments/Mahout dataset/clusters"),
                                            output, new EuclideanDistanceMeasure(), 0.001, 10,
                                                            true, 0.0, false);



            SequenceFile.Reader output_reader = new SequenceFile.Reader(fs,new Path("E:/Thesis/Experiments/Mahout dataset/output/" + Kluster.CLUSTERED_POINTS_DIR+ "/part-m-00000"), conf);
            IntWritable key = new IntWritable();
            WeightedVectorWritable value = new WeightedVectorWritable();
            while (output_reader.next(key, value)) {
                System.out.println(value.toString() + " belongs to cluster "
                         + key.toString());
             }
            reader.close();

        }

harpun harpun · Accepted Answer · 2013-03-07T21:36:30

The paths to your input/output data seem incorrect. The MapReduce job runs on a cluster. Thus the data is read from HDFS and not from your local hard disk.

The error message:

java.lang.InterruptedException: Cluster Classification Driver Job failed processing E:/Thesis/Experiments/Mahout dataset/input
    at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)

gives you a hint about the incorrect path.

Before running the job, make sure that you fist upload the input data to HDFS:

hadoop fs -mkdir input
hadoop fs -copyFromLocal E:\\file input
...

then instead of:

new Path("E:/Thesis/Experiments/Mahout dataset/input")

you should use the HDFS path:

new Path("input")

or

new Path("/user/<username>/input")

EDIT:

Use FileSystem#exists(Path path) In order to check, whether a Path is valid or not.

mahout kmeans clustering : showing error

1 Answers