How to standardize categorical variables associated with timestamps

Question

I have a dataset that has 8 mixed features (6 numeric and 2 categorical). Since the numeric values have different ranges, I will have to normalize the dataset as a whole to be able to perform farther actions such as machine learning algorithms, Dimensionality reduction (Feature Extraction).

My original dataset:

time          v1     v2    v3   ...     v7      v8
00:00:01     15435   0.7   13   ...    High   True
00:00:06     24356   3.6   23   ...    High   True
00:00:11     25567   8.3   82   ...    LOW    False
00:00:16     12345   5.4   110   ...   LOW    True
00:00:21     43246   1.7   93   ...    High   False
................................................
23:23:59     23456   3.8   45   ...    LOW    False

where v1 to v6 are numerical variable at which their values are at different ranges as it can be seen above. Moreover, v7 and v8 are categorical variables that has only two outputs (for v7 {High, Low} and for v8 {True, False}).

I did label encoding for the categorical variables (v7 and v8) where High and True were encoded 1 and LOW and False were encoded 0.

The following illustrate how the dataset looks like after the label encoding:

time          v1     v2    v3   ...     v7      v8
00:00:01     15435   0.7   13   ...     1       1
00:00:06     24356   3.6   23   ...     1       1
00:00:11     25567   8.3   82   ...     0       0
00:00:16     12345   5.4   110   ...    0       1
00:00:21     43246   1.7   93   ...     1       0
................................................
23:23:59     23456   3.8   45   ...     0       0

My question is as follows: It is easy to standardize the numerical features from v1 to v6. However, I am not sure whether to standardize the categorical observations and if yes what would be the best way to do so??

Manualmsdos Manualmsdos · Accepted Answer · 2019-05-24T06:23:39

You can use UNIX time, for example:

import pandas as pd 
import numpy as np

date = pd.date_range('1/1/2011', periods = 10, freq ='H')   
df = pd.DataFrame({'date':date})
df['unix_time'] = df['date'].astype(np.int64) // 10**9

df

output:

                 date   unix_time
0 2011-01-01 00:00:00  1293840000
1 2011-01-01 01:00:00  1293843600
2 2011-01-01 02:00:00  1293847200
3 2011-01-01 03:00:00  1293850800
4 2011-01-01 04:00:00  1293854400
5 2011-01-01 05:00:00  1293858000
6 2011-01-01 06:00:00  1293861600
7 2011-01-01 07:00:00  1293865200
8 2011-01-01 08:00:00  1293868800
9 2011-01-01 09:00:00  1293872400

Now your machine learning algorithms can compare date, also you can convert date back:

pd.to_datetime(df['unix_time'], unit='s')

output:

0   2011-01-01 00:00:00
1   2011-01-01 01:00:00
2   2011-01-01 02:00:00
3   2011-01-01 03:00:00
4   2011-01-01 04:00:00
5   2011-01-01 05:00:00
6   2011-01-01 06:00:00
7   2011-01-01 07:00:00
8   2011-01-01 08:00:00
9   2011-01-01 09:00:00
Name: unix_time, dtype: datetime64[ns]

How to standardize categorical variables associated with timestamps

2 Answers