I'm running Google Cloud Platform's sentiment analysis on 17 different documents, but it gives me the same score, with different magnitudes for each. It's my first time using this package, but as far as I can see it should be impossible for all these to have the exact same score.
The documents are pdf files of varying size, but between 15-20 pages, I exclude 3 of them as they're not relevant.
I have tried the code with other documents, and it gives me different scores for shorter documents, I suspect there's a maximum length of document it can handle, but couldn't find anything in the documentation or via google.
def analyze(text):
client = language.LanguageServiceClient(credentials=creds)
document = types.Document(content=text,
type=enums.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document).document_sentiment
entities = client.analyze_entities(document=document).entities
return sentiment, entities
def extract_text_from_pdf_pages(pdf_path):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open(pdf_path, 'rb') as fh:
last_page = len(list(enumerate(PDFPage.get_pages(fh, caching=True, check_extractable=True))))-1
for pgNum, page in enumerate(PDFPage.get_pages(fh,
caching=True,
check_extractable=True)):
if pgNum not in [0,1, last_page]:
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
if text:
return text
Results (score, magnitude):
doc1 0.10000000149011612 - 147.5
doc2 0.10000000149011612 - 118.30000305175781
doc3 0.10000000149011612 - 144.0
doc4 0.10000000149011612 - 147.10000610351562
doc5 0.10000000149011612 - 131.39999389648438
doc6 0.10000000149011612 - 116.19999694824219
doc7 0.10000000149011612 - 121.0999984741211
doc8 0.10000000149011612 - 131.60000610351562
doc9 0.10000000149011612 - 97.69999694824219
doc10 0.10000000149011612 - 174.89999389648438
doc11 0.10000000149011612 - 138.8000030517578
doc12 0.10000000149011612 - 141.10000610351562
doc13 0.10000000149011612 - 118.5999984741211
doc14 0.10000000149011612 - 135.60000610351562
doc15 0.10000000149011612 - 127.0
doc16 0.10000000149011612 - 97.0999984741211
doc17 0.10000000149011612 - 183.5
expected different results for all documents, at least small variations. (I think these magnitude scores are also way too high, compared to what I have found in documentation and elsewhere)