0
votes

I'm running Google Cloud Platform's sentiment analysis on 17 different documents, but it gives me the same score, with different magnitudes for each. It's my first time using this package, but as far as I can see it should be impossible for all these to have the exact same score.

The documents are pdf files of varying size, but between 15-20 pages, I exclude 3 of them as they're not relevant.

I have tried the code with other documents, and it gives me different scores for shorter documents, I suspect there's a maximum length of document it can handle, but couldn't find anything in the documentation or via google.

def analyze(text):
    client = language.LanguageServiceClient(credentials=creds)    

    document = types.Document(content=text, 
        type=enums.Document.Type.PLAIN_TEXT)

    sentiment = client.analyze_sentiment(document=document).document_sentiment
    entities = client.analyze_entities(document=document).entities

    return sentiment, entities


def extract_text_from_pdf_pages(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)

    with open(pdf_path, 'rb') as fh:
        last_page = len(list(enumerate(PDFPage.get_pages(fh, caching=True, check_extractable=True))))-1

        for pgNum, page in enumerate(PDFPage.get_pages(fh, 
                                  caching=True,
                                  check_extractable=True)):

            if pgNum not in [0,1, last_page]:
                page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    # close open handles
    converter.close()
    fake_file_handle.close()

    if text:
        return text

Results (score, magnitude):

doc1 0.10000000149011612 - 147.5


doc2 0.10000000149011612 - 118.30000305175781


doc3 0.10000000149011612 - 144.0


doc4 0.10000000149011612 - 147.10000610351562


doc5 0.10000000149011612 - 131.39999389648438


doc6 0.10000000149011612 - 116.19999694824219


doc7 0.10000000149011612 - 121.0999984741211


doc8 0.10000000149011612 - 131.60000610351562


doc9 0.10000000149011612 - 97.69999694824219


doc10 0.10000000149011612 - 174.89999389648438


doc11 0.10000000149011612 - 138.8000030517578


doc12 0.10000000149011612 - 141.10000610351562


doc13 0.10000000149011612 - 118.5999984741211


doc14 0.10000000149011612 - 135.60000610351562


doc15 0.10000000149011612 - 127.0


doc16 0.10000000149011612 - 97.0999984741211


doc17 0.10000000149011612 - 183.5


expected different results for all documents, at least small variations. (I think these magnitude scores are also way too high, compared to what I have found in documentation and elsewhere)

1

1 Answers

1
votes

Yes, there are some quotas in the usage of the Natural Language API.

The Natural Language API processes text into a series of tokens, which roughly correspond to word boundaries. Attempting to process tokens in excess of the Token Quota (which is by default 100.000 tokens per query) will not produce an error, but any tokens over that quota will be ignored.

For the second question it is difficult for me to evaluate the results of the Natural Language API without having access to the documents. Maybe as they are too neutral you are getting the very similar results. I have run some tests with large neutral texts and I got similar results.

Just for clarification, as stated in the Natural Language API documentation:

  • documentSentiment contains the overall sentiment of the document, which consists of the following fields:
    • score of the sentiment ranges between -1.0 (negative) and 1.0 (positive) and corresponds to the overall emotional leaning of the text.
    • magnitude indicates the overall strength of emotion (both positive and negative) within the given text, between 0.0 and +inf. Unlike score, magnitude is not normalized; each expression of emotion within the text (both positive and negative) contributes to the text's magnitude (so longer text blocks may have greater magnitudes).