1
votes

I have started using Scikit-learn and I am trying to train and predict a Gaussian Naive Bayes classificator. I don't know what I'm doing very well and I would like if someone could help me.

PROBLEM: I enter X quantity of items of type 1 and i have as response that they are of type 0

HOW I DID IT: In order to generate the data for training I make this:

 #this is of type 1
    ganado={
            "Hora": "16:43:35",
            "Fecha": "19/06/2015",
            "Tiempo": 10,
            "Brazos": "der",
            "Sentado": "no",
            "Puntuacion Final Pasteles": 50,
            "Nombre": "usuario1",
            "Puntuacion Final Botellas": 33
        }
    #this is type 0
    perdido={
            "Hora": "16:43:35",
            "Fecha": "19/06/2015",
            "Tiempo": 10,
            "Brazos": "der",
            "Sentado": "no",
            "Puntuacion Final Pasteles": 4,
            "Nombre": "usuario1",
            "Puntuacion Final Botellas": 3
        }
    train=[]
    for repeticion in range(0,400):
        train.append(ganado)

    for repeticion in range(0,1):
            train.append(perdido)

I label the data by this weak condiction:

listLabel=[]
for data in train:
    condition=data["Puntuacion Final Pasteles"]+data["Puntuacion Final Botellas"]       
    if condition<20:
        listLabel.append(0)
    else:
        listLabel.append(1)

And I generate the data for testing like this:

  #this should be type 1
    pruebaGanado={
            "Hora": "16:43:35",
            "Fecha": "19/06/2015",
            "Tiempo": 10,
            "Brazos": "der",
            "Sentado": "no",
            "Puntuacion Final Pasteles": 10,
            "Nombre": "usuario1",
            "Puntuacion Final Botellas": 33
        }
    #this should be type 0
    pruebaPerdido={
            "Hora": "16:43:35",
            "Fecha": "19/06/2015",
            "Tiempo": 10,
            "Brazos": "der",
            "Sentado": "no",
            "Puntuacion Final Pasteles": 2,
            "Nombre": "usuario1",
            "Puntuacion Final Botellas": 3
        }
        test=[]
        for repeticion in range(0,420):
            test.append(pruebaGanado)
            test.append(pruebaPerdido)

After that, I use trainand listLabel to train the classifier:

vec = DictVectorizer()
X=vec.fit_transform(train)
gnb = GaussianNB()
trained=gnb.fit(X.toarray(),listLabel)

Once I have trained the classifier I use the data for testing

testX=vec.fit_transform(test)
predicted=trained.predict(testX.toarray())

Finally the results are always 0. Could you tell me what I did wrong and how to fix it please?

1
please accept the answer if it helped you, So other can learn from it too...omerbp

1 Answers

1
votes

First of all, Since your data has features that are not informative (same value for all data), I cleaned it a bit:

ganado={
    "a": 50,
    "b": 33
}
perdido={
        "a": 4,
        "b": 3
    }
pruebaGanado={
        "a": 10,
        "b": 33
    }
pruebaPerdido={
        "a": 2,
        "b": 3
    }

All the rest is not important, and cleaning your code will help you focus on what counts.

Now, Gaussian Naive Bayes is all about probability: as you may notice, the classifier tries to tell you that:

P((a,b)=(10,33)|class=0)*P(class=0)   >   P((a,b)=(10,33)|class=1)*P(class=1)

Because it assumes that both a and b have normal distribution, and the probabilities in this case are very low, the priors you gave it -(1,400) are negligible. You can see the Formula itself here. By the way, you can get the exact probabilities:

t = [pruebaGanado,pruebaPerdido]
t = vec.fit_transform(t)
print model.predict_proba(t.toarray())
#prints:
[[ 1.  0.]
[ 1.  0.]]

So the classifier is sure that 0 is the right class. Now, lets change a bit the test data:

pruebaGanado={
    "Puntuacion Final Pasteles": 20,
    "Puntuacion Final Botellas": 33
}

Now we have:

[[ 0.  1.]
[ 1.  0.]]

So you did nothing wrong, it is all a matter of calculation. By the way, I challenge you to replace GaussianNB with MultinomialNB, and see how the priors change it all.

Also, unless you have a very good reason to use here GaussianNB, I would consider using some kind of tree classification, as in my opinion it may suit your problem better.