
I am trying to perform feature selection for logistic regression classifier. Originally there are 4 variables: name, location, gender, and label = ethnicity. The three variables, namely the name, give rise to tens of thousands of more "features", for example, name "John Snow" will give rise to 2-letter substrings like 'jo', 'oh', 'hn'... etc. The feature set undergoes DictVectorization.

I am trying to follow this tutorial (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html) but I am not sure if I am doing it right since the tutorial is using a small number of features while mine has tens of thousands after vectorization. And also the plt.show() shows a blank figure.

# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import re
import random
import time
from random import randint
import csv
import sys

from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

# Assign X and y variables
X = df.raw_name.values
X2 = df.name.values
X3 = df.gender.values
X4 = df.location.values
y = df.ethnicity_scan.values

# Feature extraction functions
def feature_full_name(nameString):
        full_name = nameString
        if len(full_name) > 1: # not accept name with only 1 character
            return full_name
        else: return '?'
    except: return '?'

def feature_avg_wordLength(nameString):
        space = 0
        for i in nameString:
            if i == ' ':
                space += 1
        length = float(len(nameString) - space)
        name_entity = float(space + 1)
        avg = round(float(length/name_entity), 0)
        return avg
        return 0

def feature_name_entity(nameString2):
    space = 0
        for i in nameString2:
            if i == ' ':
                space += 1
        return space+1
    except: return 0

def feature_gender(genString):
        gender = genString
        if len(gender) >= 1:
            return gender
        else: return '?'
    except: return '?'

def feature_noNeighborLoc(locString):
        x = re.sub(r'^[^, ]*', '', locString) # remove everything before and include first ','
        y = x[2:] # remove subsequent ',' and ' '
        return y
    except: return '?'

def list_to_dict(substring_list):
        substring_dict = {}
        for i in substring_list:
            substring_dict['substring='+str(i)] = True
        return substring_dict
    except: return '?'

# Transform format of X variables, and spit out a numpy array for all features
my_dict13 = [{'name-entity': feature_name_entity(feature_full_name(i))} for i in X2]
my_dict14 = [{'avg-length': feature_avg_wordLength(feature_full_name(i))} for i in X]
my_dict15 = [{'gender': feature_full_name(i)} for i in X3]
my_dict16 = [{'location': feature_noNeighborLoc(feature_full_name(i))} for i in X4]

my_dict17 = [{'dummy1': 1} for i in X]
my_dict18 = [{'dummy2': random.randint(0,2)} for i in X]

all_dict = []
for i in range(0, len(my_dict)):
    temp_dict = dict(my_dict13[i].items() + my_dict14[i].items()
        + my_dict15[i].items() + my_dict16[i].items() + my_dict17[i].items() + my_dict18[i].items()

newX = dv.fit_transform(all_dict)

# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]

# Fitting X and y into model, using training data
lr = LogisticRegression()
lr.fit(X_train, y_train)
dv = DictVectorizer()

# Feature selection
X_indices = np.arange(X_train.shape[-1])
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(X_train, y_train)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
    label=r'Univariate score ($-Log(p_{value})$)', color='g')


E:\Program Files Extra\Python27\lib\site-packages\sklearn\feature_selection\univariate_selection.py:111: UserWarning: Features [[0 0 0 ..., 0 0 0]] are constant.
There is no error track. There is only warning (above), it was able to generate (but an empty) graph.KubiK888

1 Answers


It looks like the way you split your data into training and testing sets is not working:

# Separate the training and testing data sets
X_train = newX[:half_cut]
X_test = newX[half_cut:]

If you already use sklearn, it is much more convenient to use the builtin splitting routine for this:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.5, random_state=0)