Feature selection in scikit learn for multiple variables and thousands+ features

Question

I am trying to perform feature selection for logistic regression classifier. Originally there are 4 variables: name, location, gender, and label = ethnicity. The three variables, namely the name, give rise to tens of thousands of more "features", for example, name "John Snow" will give rise to 2-letter substrings like 'jo', 'oh', 'hn'... etc. The feature set undergoes DictVectorization.

I am trying to follow this tutorial (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html) but I am not sure if I am doing it right since the tutorial is using a small number of features while mine has tens of thousands after vectorization. And also the plt.show() shows a blank figure.

# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import re
import random
import time
from random import randint
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

# Assign X and y variables
X = df.raw_name.values
X2 = df.name.values
X3 = df.gender.values
X4 = df.location.values
y = df.ethnicity_scan.values

# Feature extraction functions
def feature_full_name(nameString):
    try:
        full_name = nameString
        if len(full_name) > 1: # not accept name with only 1 character
            return full_name
        else: return '?'
    except: return '?'

def feature_avg_wordLength(nameString):
    try:
        space = 0
        for i in nameString:
            if i == ' ':
                space += 1
        length = float(len(nameString) - space)
        name_entity = float(space + 1)
        avg = round(float(length/name_entity), 0)
        return avg
    except:
        return 0

def feature_name_entity(nameString2):
    space = 0
    try:
        for i in nameString2:
            if i == ' ':
                space += 1
        return space+1
    except: return 0

def feature_gender(genString):
    try:
        gender = genString
        if len(gender) >= 1:
            return gender
        else: return '?'
    except: return '?'

def feature_noNeighborLoc(locString):
    try:
        x = re.sub(r'^[^, ]*', '', locString) # remove everything before and include first ','
        y = x[2:] # remove subsequent ',' and ' '
        return y
    except: return '?'

def list_to_dict(substring_list):
    try:
        substring_dict = {}
        for i in substring_list:
            substring_dict['substring='+str(i)] = True
        return substring_dict
    except: return '?'

# Transform format of X variables, and spit out a numpy array for all features
my_dict13 = [{'name-entity': feature_name_entity(feature_full_name(i))} for i in X2]
my_dict14 = [{'avg-length': feature_avg_wordLength(feature_full_name(i))} for i in X]
my_dict15 = [{'gender': feature_full_name(i)} for i in X3]
my_dict16 = [{'location': feature_noNeighborLoc(feature_full_name(i))} for i in X4]

my_dict17 = [{'dummy1': 1} for i in X]
my_dict18 = [{'dummy2': random.randint(0,2)} for i in X]

all_dict = []
for i in range(0, len(my_dict)):
    temp_dict = dict(my_dict13[i].items() + my_dict14[i].items()
        + my_dict15[i].items() + my_dict16[i].items() + my_dict17[i].items() + my_dict18[i].items()
        )
    all_dict.append(temp_dict)

newX = dv.fit_transform(all_dict)

# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]

# Fitting X and y into model, using training data
lr = LogisticRegression()
lr.fit(X_train, y_train)
dv = DictVectorizer()

# Feature selection
plt.figure(1)
plt.clf()
X_indices = np.arange(X_train.shape[-1])
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(X_train, y_train)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
    label=r'Univariate score ($-Log(p_{value})$)', color='g')
plt.show()

Warning:

E:\Program Files Extra\Python27\lib\site-packages\sklearn\feature_selection\univariate_selection.py:111: UserWarning: Features [[0 0 0 ..., 0 0 0]] are constant.

There is no error track. There is only warning (above), it was able to generate (but an empty) graph. — KubiK888

mtzl mtzl · Accepted Answer · 2015-10-26T12:39:28

It looks like the way you split your data into training and testing sets is not working:

# Separate the training and testing data sets
X_train = newX[:half_cut]
X_test = newX[half_cut:]

If you already use sklearn, it is much more convenient to use the builtin splitting routine for this:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.5, random_state=0)

Feature selection in scikit learn for multiple variables and thousands+ features

1 Answers