GCP Sentiment Analysis returns same score for 17 different documents, what am I doing wrong?

Question

I'm running Google Cloud Platform's sentiment analysis on 17 different documents, but it gives me the same score, with different magnitudes for each. It's my first time using this package, but as far as I can see it should be impossible for all these to have the exact same score.

The documents are pdf files of varying size, but between 15-20 pages, I exclude 3 of them as they're not relevant.

I have tried the code with other documents, and it gives me different scores for shorter documents, I suspect there's a maximum length of document it can handle, but couldn't find anything in the documentation or via google.

def analyze(text):
    client = language.LanguageServiceClient(credentials=creds)    

    document = types.Document(content=text, 
        type=enums.Document.Type.PLAIN_TEXT)

    sentiment = client.analyze_sentiment(document=document).document_sentiment
    entities = client.analyze_entities(document=document).entities

    return sentiment, entities


def extract_text_from_pdf_pages(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)

    with open(pdf_path, 'rb') as fh:
        last_page = len(list(enumerate(PDFPage.get_pages(fh, caching=True, check_extractable=True))))-1

        for pgNum, page in enumerate(PDFPage.get_pages(fh, 
                                  caching=True,
                                  check_extractable=True)):

            if pgNum not in [0,1, last_page]:
                page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    # close open handles
    converter.close()
    fake_file_handle.close()

    if text:
        return text

Results (score, magnitude):

doc1 0.10000000149011612 - 147.5

doc2 0.10000000149011612 - 118.30000305175781

doc3 0.10000000149011612 - 144.0

doc4 0.10000000149011612 - 147.10000610351562

doc5 0.10000000149011612 - 131.39999389648438

doc6 0.10000000149011612 - 116.19999694824219

doc7 0.10000000149011612 - 121.0999984741211

doc8 0.10000000149011612 - 131.60000610351562

doc9 0.10000000149011612 - 97.69999694824219

doc10 0.10000000149011612 - 174.89999389648438

doc11 0.10000000149011612 - 138.8000030517578

doc12 0.10000000149011612 - 141.10000610351562

doc13 0.10000000149011612 - 118.5999984741211

doc14 0.10000000149011612 - 135.60000610351562

doc15 0.10000000149011612 - 127.0

doc16 0.10000000149011612 - 97.0999984741211

doc17 0.10000000149011612 - 183.5

expected different results for all documents, at least small variations. (I think these magnitude scores are also way too high, compared to what I have found in documentation and elsewhere)

llompalles llompalles · Accepted Answer · 2019-01-07T16:43:10

Yes, there are some quotas in the usage of the Natural Language API.

The Natural Language API processes text into a series of tokens, which roughly correspond to word boundaries. Attempting to process tokens in excess of the Token Quota (which is by default 100.000 tokens per query) will not produce an error, but any tokens over that quota will be ignored.

For the second question it is difficult for me to evaluate the results of the Natural Language API without having access to the documents. Maybe as they are too neutral you are getting the very similar results. I have run some tests with large neutral texts and I got similar results.

Just for clarification, as stated in the Natural Language API documentation:

documentSentiment contains the overall sentiment of the document, which consists of the following fields:
score of the sentiment ranges between -1.0 (negative) and 1.0 (positive) and corresponds to the overall emotional leaning of the text.

magnitude indicates the overall strength of emotion (both positive and negative) within the given text, between 0.0 and +inf. Unlike score, magnitude is not normalized; each expression of emotion within the text (both positive and negative) contributes to the text's magnitude (so longer text blocks may have greater magnitudes).

GCP Sentiment Analysis returns same score for 17 different documents, what am I doing wrong?

1 Answers