0
votes

I am looking to gain some insight into my data. I am converting them into VSM using sklearn PCA and plotting them to a matplotlib graph. THis involves

  1. Casting the documents to a number matrix using pipeline

    test = pipeline.fit_transform(docs).todense()
    
  2. Fitting it to my model

    pca = PCA().fit(test)
    
  3. Then I am converting it using transform

        data = pca.transform(test)
    
  4. Finally I am plotting the results using Matplotlib

       plt.scatter(data[:,0], data[:,1], c = categories)
    

My question is this: How do I take new sentences and determine where they would lie in relation to the other documents plotted. Using an X to mark their relative positions ?

Thanks

1

1 Answers

1
votes
  1. Also cast the new documents to a numeric array

    new = pipeline.transform(new_docs).todense()
    

    Note that this uses the pipeline with the previously fitted parameters, hence it's pipeline.transform, not pipeline.fit_transform.

  2. Transform the new data using the previously fitted pca.

    new_data = pca.transform(new)
    

    This will transform the new data to the same PC-space as the original data.

  3. Add the new data to the plot using a second scatter.

    plt.scatter(data[:,0], data[:,1], c = categories)
    plt.scatter(new_data[:,0], new_data[:,1], marker = 'x')
    plt.show()