You can proceed as follows:
Load data into a dataframe:
import pandas as pd
df = pd.read_table("/tmp/test.csv", sep="\s+")
print(df)
Output:
col1 col2 col3 text
0 1 1 0 meaningful text
1 5 9 7 trees
2 7 8 2 text
Tokenize the text
column using: sklearn.feature_extraction.text.TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['text'])
Convert the tokenized data into a dataframe:
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)
Output:
meaningful text trees
0 0.795961 0.605349 0.0
1 0.000000 0.000000 1.0
2 0.000000 1.000000 0.0
Concatenate the tokenization dataframe to the orignal one:
res = pd.concat([df, df1], axis=1)
print(res)
Output:
col1 col2 col3 text meaningful text trees
0 1 1 0 meaningful text 0.795961 0.605349 0.0
1 5 9 7 trees 0.000000 0.000000 1.0
2 7 8 2 text 0.000000 1.000000 0.0
If you want to drop the column text
, you need to do that before the concatenation:
df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
print(res)
Output:
col1 col2 col3 meaningful text trees
0 1 1 0 0.795961 0.605349 0.0
1 5 9 7 0.000000 0.000000 1.0
2 7 8 2 0.000000 1.000000 0.0
Here's the full code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_table("/tmp/test.csv", sep="\s+")
v = TfidfVectorizer()
x = v.fit_transform(df['text'])
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)