2
votes

I am doing some machine learning and need help with one aspect of my coding. In my training data, I have a number of URLs of webpages and some features for these webpages. I am running TF-IDF on the text of the webpage text to create more features.

One of the features I have extracted is, for each web address, I retrieve the Google Page ranking. This value can be any value in the world,but the lower the rank, "better quality" Google has deemed it to be.

How can I normalize this figure, given that I have 7,000 web addresses and the ranks can vary enormously (www.google.com, for instance, may be ranked #1, while www.bbc.co.uk may be #1,117, other ranks will fall well outside of our 7,000 URLs).

How can I use scikit learn to effectively normalize this data so that it may be used in my machine learning algorithm? I am running a Logistic Regression which is merely trying to predict whether a webpage is "good" or not. The only features I use at the moment are the ones created with my TF-IDF on the webpage text. Ideally I would like to combine these with my page ranking feature in a way that will give me the highest cross-validation score.

Thanks very much!

So we can assume my data is in a TSV of the form :

URL GooglePageRank WebsiteText

An example of a row :

http://www.google.com 1 This would be the text of the google webpage.

I wish to normalize my ranking data and use it in my logistic regression. At the moment, I am only using the "WebsiteText" column, running a TF-IDF on it, and plugging that into my Logistic Regression. I want to learn how to combine this column with my normalised GooglePageRank column and use these two columns in my Logistic Regression - how can I do this?

Here is my code thus far :

  import numpy as np
  from sklearn import metrics,preprocessing,cross_validation
  from sklearn.feature_extraction.text import TfidfVectorizer
  import sklearn.linear_model as lm
  import pandas as p
  loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')

  print "loading data.."
  traindata = list(np.array(p.read_table('train.tsv'))[:,2])
  testdata = list(np.array(p.read_table('test.tsv'))[:,2])
  y = np.array(p.read_table('train.tsv'))[:,-1]

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)

  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None)

  X_all = traindata + testdata
  lentrain = len(traindata)

  print "fitting pipeline"
  tfv.fit(X_all)
  print "transforming data"
  X_all = tfv.transform(X_all)

  X = X_all[:lentrain]
  X_test = X_all[lentrain:]

  print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

  print "training on full data"
  rd.fit(X,y)
  pred = rd.predict_proba(X_test)[:,1]
  testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
  pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
  pred_df.to_csv('benchmark.csv')
  print "submission file created.."

*Edit : *

This is what I am currently running -

from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
import sklearn.preprocessing
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=',')
print "loading data.."

#load train/test data for TF-IDF -- I know this is bad practice, but keeping it this way for the moment!
traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])

#load labels
y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]

#Load Integer values and append together
AllAlexaInfo = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-1]

#make tfidf object
tfv = TfidfVectorizer(min_df=1, max_features=None, strip_accents='unicode',  
                      analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), 
                      use_idf=1,smooth_idf=1,sublinear_tf=1)
div = DictVectorizer()
X = []
X_all = traindata + testdata
lentrain = len(traindata)
# fit/transform the TfidfVectorizer on the training data
vect = tfv.fit_transform(X_all) #bad practice, but using this for the moment!

for i, alexarank in enumerate(AllAlexaInfo):
    feature_dict = {'alexarank': AllAlexaInfo}
    # get ith row of the tfidf matrix (corresponding to sample)
    row = vect.getrow(i)    

    # filter the feature names corresponding to the sample
    all_words = tfv.get_feature_names()
    words = [all_words[ind] for ind in row.indices] 

    # associate each word (feature) with its corresponding score
    word_score = dict(zip(words, row.data)) 

    # concatenate the word feature/score with the datamining feature/value
    X.append(dict(word_score.items() + feature_dict.items()))

div.fit_transform(X)  # training data based on both Tfidf features and pagerank
sc = preprocessing.StandardScaler().fit(X)
X = sc.transform(X)
X_test = X_all[lentrain:]
X_test = sc.transform(X_test)

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."

This appears to be running forever, also I believe I have a problem with the "alexarank" value not being input correctly - how can I fix this?

1
IIRC, you would like to combine features from your TfidfVectorizer with the pagerank value, thus having your logistic regression classfier making a choice based on both the text features and the pagerank value?Balthazar Rouberol
@BalthazarRouberol This is correct, yes :)Simon Kiely

1 Answers

3
votes

Based on your answer to my comment, I would perform accordingly:

tfv = TfidfVectorizer(
    min_df=3,
    max_features=None,
    strip_accents='unicode',                    
    analyzer='word',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 2), 
    use_idf=1,
    smooth_idf=1,
    sublinear_tf=1)
div = DictVectorizer()

X = []

# fit/transform the TfidfVectorizer on the training data
vectors = tfv.fit_transform(traindata)

for i, pagerank in enumerate(pageranks):
    feature_dict = {'pagerank': pagerank}
    # get ith row of the tfidf matrix (corresponding to sample)
    row = vect.getrow(i)    

    # filter the feature names corresponding to the sample
    all_words = tfv.get_feature_names()
    words = [all_words[ind] for ind in row.indices] 

    # associate each word (feature) with its corresponding score
    word_score = dict(zip(words, row.data)) 

    # concatenate the word feature/score with the datamining feature/value
    X.append(dict(word_score.items() + feature_dict.items()))

div.fit_transform(X)  # training data based on both Tfidf features and pagerank