3
votes

I am trying to create a machine learning model using DecisionTreeClassifier. To train & test my data I imported train_test_split method from scikit learn. But I can not understand one of its arguments called random_state.

What is the significance of assigning numeric values to random_state of model_selection.train_test_split function and how may I know which numeric value to assign random_state for my decision tree?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)
4

4 Answers

6
votes

As the docs mention, random_state is for the initialization of the random number generator used in train_test_split (similarly for other methods, as well). As there are many different ways to actually split a dataset, this is to ensure that you can use the method several times with the same dataset (e.g. in a series of experiments) and always get the same result (i.e. the exact same train and test sets here), i.e for reproducibility reasons. Its exact value is not important and is not something you have to worry about.

Using the example in the docs, setting random_state=42 ensures that you get the exact same result shown there (the code below is actually run in my machine, and not copy-pasted from the docs):

import numpy as np
from sklearn.model_selection import train_test_split

X, y = np.arange(10).reshape((5, 2)), range(5)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

X_train
# array([[4, 5],
#        [0, 1],
#        [6, 7]])

y_train
# [2, 0, 3]

X_test
# array([[2, 3],
#        [8, 9]])

y_test
# [1, 4]

You should experiment yourself with different values for random_state (or without specifying it at all) in the above snippet to get the feeling.

3
votes

Providing a value to random state will be helpful in reproducing the same values in the split when you re-run the program.

If you don't provide any value to the random state, we will get different set of values for test and train after each run. In such a case, if you encounter any error, then it will not be helpful in debugging.

Example:

Setup:

from sklearn.model_selection import train_test_split
import pandas as pd

data = pd.read_csv("diabetes.csv")
X=data.iloc[0:,0:8]
X.head()
y=data.iloc[0:,-1]
y.head()

Loop with random_state:

for _ in range(2):
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)
    print(X_train.head())
    print(X_test.head())
  • Note the data is the same for both iterations

Loop without random_state:

for _ in range(2):
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33)
    print(X_train.head())
    print(X_test.head())
  • Note the data is not the same for both iterations

If you run the code, and see the output, you will see when random_state is the same, it will provide the same train / test set, but when random_state is not provided, the set of values in test / train is different each time.

2
votes

If you don't specify random_state every time you execute your code you will get a different (random) split. Instead if you give a random_state value the split will always be the same. It is often used for experiments reproducibility.

For example:

X = [[1,5],[2,6],[3,2],[4,7], [5,5], [6,2], [7,1],[8,6]]
y = [1,2,3,4,5,6,7,8]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
X_train_rs, X_test_rs, y_train_rs, y_test_rs = train_test_split(X, y, test_size=0.33, random_state=324)

print("WITH RANDOM STATE: ")
print("X_train: {}\ny_train: {}\nX_test: {}\ny_test: {}".format(X_train_rs, X_test_rs, y_train_rs, y_test_rs))
print("WITHOUT RANDOM STATE: ")
print("X_train: {}\ny_train: {}\nX_test: {}\ny_test: {}".format(X_train, X_test, y_train, y_test))

If you run this code different times you can see that the splits without random state change at every run.

As explained in the sklearn documentation, random_state can be an integer if you want specify the random number generator seed (the most frequent case), or directly an instance of RandomState class.

0
votes

random_state argument is just to seed random order. if you give different random_state it will split dataset in different order. if you provide same random_state every time then split will be same. dataset will split in same order.

If you want your dataset to split in same order every time then provide same random_state.