0
votes

I am trying to use KDDCup 99 data in my machine learning project. I decided to use Spark MLLib and trying our Random Forest first. I am rrefering the example of Random Forest analysis here.

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestClassificationExample.java https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestRegressionExample.java

Now, the data they have used is in the following format.

0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252 186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224 211:252 212:253 213:252 214:202 215:84 216:252 217:253 218:122 236:163 237:252 238:252 239:252 240:253 241:252 242:252 243:96 244:189 245:253 246:167 263:51 264:238 265:253 266:253 267:190 268:114 269:253 270:228 271:47 272:79 273:255 274:168 290:48 291:238 292:252 293:252 294:179 295:12 296:75 297:121 298:21 301:253 302:243 303:50 317:38 318:165 319:253 320:233 321:208 322:84 329:253 330:252 331:165 344:7 345:178 346:252 347:240 348:71 349:19 350:28 357:253 358:252 359:195 372:57 373:252 374:252 375:63 385:253 386:252 387:195 400:198 401:253 402:190 413:255 414:253 415:196 427:76 428:246 429:252 430:112 441:253 442:252 443:148 455:85 456:252 457:230 458:25 467:7 468:135 469:253 470:186 471:12 483:85 484:252 485:223 494:7 495:131 496:252 497:225 498:71 511:85 512:252 513:145 521:48 522:165 523:252 524:173 539:86 540:253 541:225 548:114 549:238 550:253 551:162 567:85 568:252 569:249 570:146 571:48 572:29 573:85 574:178 575:225 576:253 577:223 578:167 579:56 595:85 596:252 597:252 598:252 599:229 600:215 601:252 602:252 603:252 604:196 605:130 623:28 624:199 625:252 626:252 627:253 628:252 629:252 630:233 631:145 652:25 653:128 654:252 655:253 656:252 657:141 658:37

And the KDDCup Data is the following format.

0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.

Q1. Can someone please guide me what are the factors of the data used in spark example. I do understand that the first column 0 indicates the label (True/False) and the data is space separated. But don't understand what the : separated couple of numbers mean.

Q2. Also don't exactly know how to convert my KDDCup 99 data strings into numerical value for the Random Forest. Is there any inbuilt function for it in Spark MLLib?

Q3. Any example of Spark MLLib using a real world data would be very helpful.

Thank you

1
I am relatively new to ML , to my knowledge, the data format you mentioned is called LIBSVM format , you can find lot of references on it. one of the links that could be useful csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f306Aditya
one more useful link to understand the LIBSVM format souledsole.wordpress.com/2016/01/13/… hope this could be usefulAditya
Ok Aditya, this helps a lot. Can still go in one direction rather than being stuck and spending too much time in figuring out from the pattern of the data. However, there seems to be three options with "loadLabeledPoints", "loadLibSVMFile" and "loadVectors". Please inform if you have any idea on the usage of those different functions.Keyur Golani

1 Answers

0
votes

So I was able to figure out a way to work with KDDCup 99 dataset into Spark after all. I vectorized the string data into integer indexes reading them through a properties file. Also I put the labels at the beginning of the record. Also used a python code to convert my final KDDCup data into Libsvm data format. My code for processing the whole thing can be found at below github repository. https://github.com/keyurgolani/RandomForestSpark.git

Was able to figure this all out thanks to the contribution of Addy.