Scikit-Learn's TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features. Instead of raw documents, I would like to convert a matrix of feature names to TF-IDF features.
The corpus you feed fit_transform()
is supposed to be an array of raw documents, but instead I'd like to be able to feed it (or a comparable function) an array of arrays of features per document. For example:
corpus = [
['orange', 'red', 'blue'],
['orange', 'yellow', 'red'],
['orange', 'green', 'purple (if you believe in purple)'],
['orange', 'reddish orange', 'black and blue']
]
... as opposed to a one dimensional array of strings.
I know that I can define my own vocabulary for the TfidfVectorizer to use, so I could easily make a dict of unique features in my corpus and their indices in the feature vectors. But the function still expects raw documents, and since my features are of varying lengths and occasionally overlap (for example, 'orange' and 'reddish orange'), I can't just concatentate my features into single strings and use ngrams.
Is there a different Scikit-Learn function I can use for this that I'm not finding? Is there a way to use the TfidfVectorizer that I'm not seeing? Or will I have to homebrew my own TF-IDF function to do this?