I am doing some machine learning and need help with one aspect of my coding. In my training data, I have a number of URLs of webpages and some features for these webpages. I am running TF-IDF on the text of the webpage text to create more features.
One of the features I have extracted is, for each web address, I retrieve the Google Page ranking. This value can be any value in the world,but the lower the rank, "better quality" Google has deemed it to be.
How can I normalize this figure, given that I have 7,000 web addresses and the ranks can vary enormously (www.google.com, for instance, may be ranked #1, while www.bbc.co.uk may be #1,117, other ranks will fall well outside of our 7,000 URLs).
How can I use scikit learn to effectively normalize this data so that it may be used in my machine learning algorithm? I am running a Logistic Regression which is merely trying to predict whether a webpage is "good" or not. The only features I use at the moment are the ones created with my TF-IDF on the webpage text. Ideally I would like to combine these with my page ranking feature in a way that will give me the highest cross-validation score.
Thanks very much!
So we can assume my data is in a TSV of the form :
URL GooglePageRank WebsiteText
An example of a row :
http://www.google.com 1 This would be the text of the google webpage.
I wish to normalize my ranking data and use it in my logistic regression. At the moment, I am only using the "WebsiteText" column, running a TF-IDF on it, and plugging that into my Logistic Regression. I want to learn how to combine this column with my normalised GooglePageRank column and use these two columns in my Logistic Regression - how can I do this?
Here is my code thus far :
import numpy as np
from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')
print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:,2])
testdata = list(np.array(p.read_table('test.tsv'))[:,2])
y = np.array(p.read_table('train.tsv'))[:,-1]
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001,
C=1, fit_intercept=True, intercept_scaling=1.0,
class_weight=None, random_state=None)
X_all = traindata + testdata
lentrain = len(traindata)
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."
*Edit : *
This is what I am currently running -
from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
import sklearn.preprocessing
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=',')
print "loading data.."
#load train/test data for TF-IDF -- I know this is bad practice, but keeping it this way for the moment!
traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])
#load labels
y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]
#Load Integer values and append together
AllAlexaInfo = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-1]
#make tfidf object
tfv = TfidfVectorizer(min_df=1, max_features=None, strip_accents='unicode',
analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2),
use_idf=1,smooth_idf=1,sublinear_tf=1)
div = DictVectorizer()
X = []
X_all = traindata + testdata
lentrain = len(traindata)
# fit/transform the TfidfVectorizer on the training data
vect = tfv.fit_transform(X_all) #bad practice, but using this for the moment!
for i, alexarank in enumerate(AllAlexaInfo):
feature_dict = {'alexarank': AllAlexaInfo}
# get ith row of the tfidf matrix (corresponding to sample)
row = vect.getrow(i)
# filter the feature names corresponding to the sample
all_words = tfv.get_feature_names()
words = [all_words[ind] for ind in row.indices]
# associate each word (feature) with its corresponding score
word_score = dict(zip(words, row.data))
# concatenate the word feature/score with the datamining feature/value
X.append(dict(word_score.items() + feature_dict.items()))
div.fit_transform(X) # training data based on both Tfidf features and pagerank
sc = preprocessing.StandardScaler().fit(X)
X = sc.transform(X)
X_test = X_all[lentrain:]
X_test = sc.transform(X_test)
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."
This appears to be running forever, also I believe I have a problem with the "alexarank" value not being input correctly - how can I fix this?