0
votes

Hi I was trying to run the example from Mahout in action Chapter 7(k-Mean Clustering). Can somebody guide me how to run that example in a Hadoop Cluster(single Node CDH-4.2.1) with Mahout(0.7)

These are the steps i followed:

  1. Copied the code( from Github)into my Eclipse IDE, on my local machine.

  2. Incuded these jars into my Eclipse project.

hadoop-common-2.0.0-cdh4.2.1.jar

hadoop-hdfs-2.0.0-cdh4.2.1.jar

hadoop-mapreduce-client-core-2.0.0-cdh4.2.1.jar

mahout-core-0.7-cdh4.3.0.jar

mahout-core-0.7-cdh4.3.0-job.jar

mahout-math-0.7-cdh4.3.0.jar

  1. Made a Jar of this project and copied that jar onto my Hadoop Cluster

  2. Executed this command

user@INFPH01463U:~$ hadoop jar /home/user/apurv/Kmean.jar tryout.SimpleKMeansClustering

which gave me following Error

Exception in thread "main" java.lang.NoClassDefFoundError: FileSystem
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
        at java.lang.Class.getMethod0(Class.java:2670)
        at java.lang.Class.getMethod(Class.java:1603)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:202)
Caused by: java.lang.ClassNotFoundException: FileSystem
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        ... 5 more

Can anyone help me with what i'm missing or is my way of execution wrong?

Secondly i would like to know how can i run K-mean Clustering on a CSV file??

Thanks In Advance :)

1
Can you run the examples included with Hadoop and Mahout? Maybe the "hadoop" command you are using is broken and doesn't set the classpath right.Has QUIT--Anony-Mousse
i'm able to run a MR code on that hadoop cluster and even i'm able to run the Mahout Synthetic control data example.user2454360

1 Answers

0
votes

The given code is misleading, the code

Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure());
    writer.append(new Text(cluster.getIdentifier()), cluster);
}
writer.close();

KMeansDriver.run(conf, new Path("testdata/points"), new Path("testdata/clusters"),
  new Path("output"), new EuclideanDistanceMeasure(), 0.001, 10,
  true, false);

SequenceFile.Reader reader = new SequenceFile.Reader(fs,
    new Path("output/" + Cluster.CLUSTERED_POINTS_DIR
             + "/part-m-00000"), conf);

should be replaced by

Kluster cluster = new Kluster(vec, i, new EuclideanDistanceMeasure());
    writer.append(new Text(cluster.getIdentifier()), cluster);
}
writer.close();

KMeansDriver.run(conf, new Path("testdata/points"), new Path("testdata/clusters"),
  new Path("output"), new EuclideanDistanceMeasure(), 0.001, 10,
  true, false);

SequenceFile.Reader reader = new SequenceFile.Reader(fs,
    new Path("output/" + Kluster.CLUSTERED_POINTS_DIR
             + "/part-m-00000"), conf);

Cluster is an interface whereas Kluster is a class. Please check Mahout API Javadoc for more information.

To run kmeans with csv file, first you have to create a SequenceFile to pass as an argument in KmeansDriver. The following code reads each line of the CSV file "points.csv" and converts it into vector and write it to the SequenceFile "points.seq"

try (
            BufferedReader reader = new BufferedReader(new FileReader("testdata2/points.csv"));
            SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,new Path("testdata2/points.seq"), LongWritable.class, VectorWritable.class)
        ) {
            String line;
            long counter = 0;
            while ((line = reader.readLine()) != null) {
                String[] c = line.split(",");
                if(c.length>1){
                    double[] d = new double[c.length];
                    for (int i = 0; i < c.length; i++)
                            d[i] = Double.parseDouble(c[i]);
                    Vector vec = new RandomAccessSparseVector(c.length);
                    vec.assign(d);

                VectorWritable writable = new VectorWritable();
                writable.set(vec);
                writer.append(new LongWritable(counter++), writable);
            }
        }
        writer.close();
    }

Hope it helps!!