0
votes

I have a mixed dataset of numerical and categorical variables. I tried to implement the kprototype based on https://www.kaggle.com/rahultej/k-prototypes-correlation-randomforest and https://journal.r-project.org/archive/2018/RJ-2018-048/RJ-2018-048.pdf

So, basically, I just removed the columns containing NA in my dataframe and tried to implement the kprototype without doing any data transformation on the categorical variable.

The columns containing categorical data does not have the same number of categories in it. Example: Column X has 4 categories and Column Y has 15 categories. I'm not sure if kprototype would work for such a scenario??

I'm getting the below error

Error in Ops.data.frame(x[, j], rep(protos[i, j], nrows)) : list of length 1043 not meaningful

I also tried converting the categorical variables into numerical. I have not used the scale function though. When I convert the categorical variable to numericals, it throws the error "No factor variables in x! Try using kmeans()..."

data_kproto <- kproto(data, k = 4)
1

1 Answers

0
votes

Turn all factors with more than 2 levels to individual columns. Scale the numeric data to z scores. Make sure the data is a data frame.

# Turn to dummies
library(caret)
dummies <- dummyVars(" ~ .", data)
data <- data.frame(predict(dummies, newdata = data))

# Scale
data <- scale(data[,c("numeric_1", "numeric_2")])

# Check data frame
data <- as.data.frame(data)

# kproto
data_kproto <- kproto(data, k = 4)