3
votes

I am into a process of doing a POC on Retail Transaction Data using few Machine learning Algorithms and coming up with a prediction model for Out of stock analysis. My questions might sound stupid but I would really appreciate if you or anyone else can answer me.

So far I have been able to get a data set ==> Convert the features into a (labelpoint , Feature Vectors) ==> Train a ML model ==> Run the model on Test DataSet and ==> Get the predictions.

Problem 1:

Since I have no experience on any of the JAVA/Python/Scala languages, I am building my features in the database and saving that data as a CSV file for my machine learning Algorithm.

How do we create features using Scala from raw data.

Problem 2:

The Source Data set consists of many features for a set of (Store, Product , date) and their recorded OOS events (Target)

StoreID(Text column), ProductID(Text Column), TranDate , (Label/Target), Feature1, Feature2........................FeatureN

Since the Features can only contain numeric values so, I just create features out of the numeric columns and not the text ones (Which is the natural key for me). When I run the model on a validation set I get a (Prediction, Label) array back.

Now how do I link this resultant set back to the original data set and see which specific (Store, Product, Date) might have a possible Out Of Stock event ?

I hope the problem statement was clear enough.

MJ

1
If you spend a few minutes looking at other Stack Overflow questions, you will they are short, succinct, and usually include some code and/or error messages. Questions that would require an entire multipage article, book, or course to explain are too broad. A question very similar to what you pose is answered in a weeks [week 4] lab and lecture in the edX MOOC Scalable Machine Learning, but using Apache-Spark and Python instead of Scala.Paul
Basically, you can post a question here, with source code/error messages, on spark when you get stuck and someone will be glad to help. But if you need someone to write your entire ML model/system, or demonstrate one, that is more in the realm of blog articles, courses, and paid consulting.Paul
And the missing step to combine the predictions and the validation or test data is called a JOIN. Same concept as joining two db tables, the predictions and validation data need to have (or be mapped to have) a common key. If your rows of data don't have something that can be used as a unique key to that row you could use the row number. In spark there's zipwithindex for that.Paul
Thanks Paul..I can understand your concern and my bad if I was not able to send the right message through my question. I have been working on complex SQLs since last 13 years and luckily :) know the JOIN concept and the need of having a common Key in two data sets to be joined.Mjas
I have actually already written all the coding for this its running perfectly fine..All I wanted to know is the way the MLLIB algo runs , it only runs on the features of the data and the feature vector matrix does include the natural key (StoreId, prodId, Date in my case ) of the original data set. In that case how do we join back the predictions back to the data set. Does the result set preserves the Order of the original data ?Mjas

1 Answers

3
votes

Spark's Linear Regression Example

Here's a snippet from the Spark Docs Linear Regression example that is fairly instructive and easy to follow.

It solves both your "Problem 1" and "Problem 2"

It doesn't need a JOIN and doesn't even rely on RDD order.

// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")

Here data is a RDD of text lines

val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

Problem 1: Parsing the Features

This is data dependent. Here we see that lines are being split on , into fields. It appears this data was a CSV of entirely numeric data.

The first field is treated as the label of a labelled point (dependent variable), and the rest of the fields are converted from text to double (floating point) and stuck in a vector. This vector holds the features or independent variables.

In your own projects, the part of this you need to remember is the goal of parsing into an RDD of LabeledPoints where the 1st parameter of LabeledPoint, the label, is the true dependent numeric value and the features, or 2nd parameter, is a Vector of numbers.

Getting the data into this condition requires knowing how to code. Python may be easiest for data parsing. You can always use other tools to create a purely numeric CSV, with the dependent variable in the first column, and the numeric features in the other columns, and no header line -- and then duplicate the example parsing function.

// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

At this point we have a trained model object. The model object has a predict method that operates on feature vectors and returns estimates of the dependent variable.

Encoding Text features

The ML routines typically want numeric feature vectors, but you can often translate free text or categorical features (color, size, brand name) into numeric vectors in some space. There are a variety of ways to do this, such as Bag-Of-Words for text, or One Hot Encoding for categorical data where you code a 1.0 or 0.0 for membership in each possible category (watch out for multicollinearity though). These methodologies can create large feature vectors, which is why there are iterative methods available in Spark for training models. Spark also has a SparseVector() class, where you can easily create vectors with all but certain feature dimensions set to 0.0

Problem 2: Comparing model Predictions to the True values

Next they test this model with the training data, but the calls would be the same with external test data provided that the test data is a RDD of LabeledPoint( dependent value, Vector(features)). The input could be changed by changing the variable parsedData to some other RDD.

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

Notice that this returns tuples of the true dependent variable previously stored in point.label, and the model's prediction from the point.features for each row or LabeledPoint.

Now we are ready to do Mean Squared Error, since the valuesAndPreds RDD contains tuples (v,p) of true value v and the prediction p both of type Double.

The MSE is a single number, first the tuples are mapped to an rdd of squared distances ||v-p||**2 individually, and then averaged, yielding a single number.

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

println("training Mean Squared Error = " + MSE)

Spark's Logistic Example

This is similar, but here you can see data is already parsed and split into training and test sets.

// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

Here the model is trained against the training set.

// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
  .setNumClasses(10)
  .run(training)

And tested (compared) against the test set. Notice that even though this is a different model (Logistic instead of Linear) there is still a model.predict method that takes a point's features vector as a parameter and returns the prediction for that point.

Once again the prediction is paired with the true value, from the label, in a tuple for comparison in a performance metric.

// Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)
}

// Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision)

What about JOIN? So RDD.join comes in if you have two RDDs of (key, value) pairs, and need an RDD corresponding to the intersection of keys with both values. But we didn't need that here.