I have the following data:
Rank Platforms Technology
high Windows||Linux Unity
high Linux
low Windows Unreal
low Linux||MacOs GameMakerStudio||Unity||Unreal
low GameMakerStudio
Both Platforms
and Technology
are categorical variables. The issue here is they can have one, or Empty, or, especially multiple values like GameMakerStudio||Unity||Unreal
. I am building a logistic regression model to predict Rank
data.
I am attempting to encoding these variables for my model. However, I have not found any solution for list-type categorical values. I have read this page Encoding Categorical Variables and found that One-hot encoding is the most closely related, but still does not address my issue.
I could, of course, manually encode it. For example, there are around 7 distinct platform value for Platforms
column, if Platforms = Windows||Linux
, I could set 2 columns is_windows = true
and is_linux = true
. But for Technology
column, there are 21 distinct values.
Is there a way to encode it automatically?