0
votes

I have a dataset that has 8 mixed features (6 numeric and 2 categorical). Since the numeric values have different ranges, I will have to normalize the dataset as a whole to be able to perform farther actions such as machine learning algorithms, Dimensionality reduction (Feature Extraction).

My original dataset:

time          v1     v2    v3   ...     v7      v8
00:00:01     15435   0.7   13   ...    High   True
00:00:06     24356   3.6   23   ...    High   True
00:00:11     25567   8.3   82   ...    LOW    False
00:00:16     12345   5.4   110   ...   LOW    True
00:00:21     43246   1.7   93   ...    High   False
................................................
23:23:59     23456   3.8   45   ...    LOW    False

where v1 to v6 are numerical variable at which their values are at different ranges as it can be seen above. Moreover, v7 and v8 are categorical variables that has only two outputs (for v7 {High, Low} and for v8 {True, False}).

I did label encoding for the categorical variables (v7 and v8) where High and True were encoded 1 and LOW and False were encoded 0.

The following illustrate how the dataset looks like after the label encoding:

time          v1     v2    v3   ...     v7      v8
00:00:01     15435   0.7   13   ...     1       1
00:00:06     24356   3.6   23   ...     1       1
00:00:11     25567   8.3   82   ...     0       0
00:00:16     12345   5.4   110   ...    0       1
00:00:21     43246   1.7   93   ...     1       0
................................................
23:23:59     23456   3.8   45   ...     0       0

My question is as follows: It is easy to standardize the numerical features from v1 to v6. However, I am not sure whether to standardize the categorical observations and if yes what would be the best way to do so??

2

2 Answers

0
votes

You can use UNIX time, for example:

import pandas as pd 
import numpy as np

date = pd.date_range('1/1/2011', periods = 10, freq ='H')   
df = pd.DataFrame({'date':date})
df['unix_time'] = df['date'].astype(np.int64) // 10**9

df

output:

                 date   unix_time
0 2011-01-01 00:00:00  1293840000
1 2011-01-01 01:00:00  1293843600
2 2011-01-01 02:00:00  1293847200
3 2011-01-01 03:00:00  1293850800
4 2011-01-01 04:00:00  1293854400
5 2011-01-01 05:00:00  1293858000
6 2011-01-01 06:00:00  1293861600
7 2011-01-01 07:00:00  1293865200
8 2011-01-01 08:00:00  1293868800
9 2011-01-01 09:00:00  1293872400

Now your machine learning algorithms can compare date, also you can convert date back:

pd.to_datetime(df['unix_time'], unit='s')

output:

0   2011-01-01 00:00:00
1   2011-01-01 01:00:00
2   2011-01-01 02:00:00
3   2011-01-01 03:00:00
4   2011-01-01 04:00:00
5   2011-01-01 05:00:00
6   2011-01-01 06:00:00
7   2011-01-01 07:00:00
8   2011-01-01 08:00:00
9   2011-01-01 09:00:00
Name: unix_time, dtype: datetime64[ns]
0
votes

Normalization rescales the values between the range 0 to 1. Your values are already in this range, you would have required normalization of categorical values only if the cardinality was really really high but for now you can keep them as it is. I will also suggest you to normalize your whole dataset. Then all the values will be in the same range & algo will not erroneously learn anything by giving preference to any feature with higher numerical values. You can find both normalization & scaling in scikit learn itself.

from sklearn import preprocessing
X=your_data
normalized_X = preprocessing.normalize(X)