Question: What is the best way to convert sparse matrices resulting from sklearn's CountVectorizer and TfidfTransformer into Pandas DataFrame columns with a separate row for each bigram and its corresponding frequency and tf-idf score?
Pipeline: Bring in text data from a SQL DB, split text into bigrams and calculate the frequency per document and the tf-idf per bigram per document, load the results back into the SQL DB.
Current State:
Two columns of data are brought in (number
, text
). text
is cleaned to produce a third column cleanText
:
number text cleanText
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
This DataFrame is fed into sklearn's feature extraction:
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)
tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)
Then the matrices are fed back into the original DataFrame after converting them to an array:
data['frequency'] = list(dt_mat.toarray())
data['tfidf_score']=list(tfidf_mat.toarray())
Output:
number text cleanText \
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
frequency tfidf_score
0 [0, 1, 0, 0, 0, 1, 0] [0.0, 0.707106781187, 0.0, 0.0, 0.0, 0.7071067...
1 [0, 0, 1, 0, 1, 0, 1] [0.0, 0.0, 0.57735026919, 0.0, 0.57735026919, ...
2 [1, 0, 0, 1, 0, 0, 0] [0.707106781187, 0.0, 0.0, 0.707106781187, 0.0...
Problems:
- The feature names (i.e. bigrams) are not in the DataFrame
- The
frequency
andtfidf_score
are not on separate lines for each bigram
Desired Output:
number bigram frequency tfidf_score
0 123 farmer plants 1 0.70
0 123 plants grain 1 0.56
1 234 farmer son 1 0.72
1 234 son go 1 0.63
1 234 go fishing 1 0.34
2 345 fisher catches 1 0.43
2 345 catches tuna 1 0.43
I managed to get one of the numeric columns assigned to separate rows of the DataFrame with this code:
data.reset_index(inplace=True)
rows = []
_ = data.apply(lambda row: [rows.append([row['number'], nn])
for nn in row.tfidf_score], axis=1)
df_new = pd.DataFrame(rows, columns=['number', 'tfidf_score'])
Output:
number tfidf_score
0 123 0.000000
1 123 0.707107
2 123 0.000000
3 123 0.000000
4 123 0.000000
5 123 0.707107
6 123 0.000000
7 234 0.000000
8 234 0.000000
9 234 0.577350
10 234 0.000000
11 234 0.577350
12 234 0.000000
13 234 0.577350
14 345 0.707107
15 345 0.000000
16 345 0.000000
17 345 0.707107
18 345 0.000000
19 345 0.000000
20 345 0.000000
However, I am unsure how to do this for both numeric columns, and this doesn't bring in the bigrams (feature names) themselves. Also, this method requires an array (which is why I converted the sparse matrices to arrays in the first place), and I would like to avoid this if possible due to performance issues and the fact that I would then have to strip the meaningless rows.
Any insight is greatly appreciated! Thank you very much for taking the time to read this question - I apologize for the length. Please let me know if there's anything I can do to improve the question or clarify my process.