1
votes

I have doubt with Naive bayes with numeric and non numeric features . like I have 5 independent independent parameter on these i want to classify data .

Male,Suspicion of Alcohol,Weekday,12am-4am,75,30-39 Male,Moving Traffic Violation,Weekday,12am-4am,0,20-24 Male,Suspicion of Alcohol,Weekend,4am-8am,12,40-49 Male,Suspicion of Alcohol,Weekday,12am-4am,0,50-59 Female,Road Traffic Collision,Weekend,12pm-4pm,0,20-24 Male,Road Traffic Collision,Weekday,12pm-4pm,0,25-29 Male,Road Traffic Collision,Weekday,8pm-12pm,0,Other Male,Other,Weekday,8am-12pm,23,60-69 Male,Moving Traffic Violation,Weekend,12pm-4pm,26,30-39 Female,Road Traffic Collision,Weekend,4am-8am,61,16-19 Male,Moving Traffic Violation,Weekend,4pm-8pm,74,25-29 Male,Road Traffic Collision,Weekday,12am-4am,0,Other Male,Moving Traffic Violation,Weekday,8pm-12pm,0,16-19 Male,Road Traffic Collision,Weekday,8pm-12pm,0,Other Male,Moving Traffic Violation,Weekend,4am-8am,0,30-39

You can see some parameters are numeric some are non numeric . Any know how to convert non numeric data to numeric data .

  1. How to convert non numeric to numeric
  2. If i use TF then it will be right or not
  3. If TF is right then why
1

1 Answers

0
votes

You can start with the following:

  • convert each of the features to a categorical value by applying a factorizer. An example:

  • Feature1: Male = 0 Female = 1

  • Feature2:Alcohol=0 collision=1 Moving-Violation=2
  • ..

and so on.

Each different possible value of one "column" should have its own specific numerical representation in your factorized result. Hopefully things like 4pm-8pm are non-overlapping: but if they are you can start with ignoring that detail and later do some more intelligent manual featurization if time allows.

Each entry/line in your input consists of around a dozen "features". Then you can create a feature vector out of each line. The results are now tf-idf ready (TM). You can apply the NB algorithm to your newly minted feature vectors - and find relative similarities.