Machine learning entity candidate scoring (not recognition)

Question

I am trying to understand the machine learning part behind Google's Smart Linkify. The article states the following regarding their generate candidate entities model.

A given input text is first split into words (based on space separation), then all possible word subsequences of certain maximum length (15 words in our case) are generated, and for each candidate the scoring neural net assigns a value (between 0 and 1) based on whether it represents a valid entity:

Next, the generated entities that overlap are removed, favoring the ones with the higher score over the conflicting ones with a lower score.

If I understand correctly the model tries every word in the sentence and a combination of that word up to 15 words total?

How can you train such model? I assume it's supervised learning but don't understand how such data could be labeled. Is it similar to NER where the entity is specified by character position? And there are only 2 entities in the data entity and non-entity.

And for the output of the model, the so called "candidate score", how can a a neural network return a single numerical value? (the score). Or is the output layer just a single node?

A more detailed explanation on:

Possible word subsequences of certain maximum length means it considers every word with the 7 words before and 7 after the word?
How can the neural net generate a score when its a binary classification entity and non-entity? Or do they mean the probability score for entity?
How to train a binary NER? Like any other NER except replace all entities to type 'entity' and then generate negative samples for non-entity?
How can this model be fast, as they claim, when it processes every word in the text plus 7 words before and after said word?

is what I'm looking for, to understand.

I'm voting to close this question as off-topic because it belongs on Data Science — G. Anderson
I do wonder why stackoverflow has all the tags for machine learning then — Rien
There are many valid machine learning questions that are on-topic, such as those that include a specific question with a minimal reproducible example — G. Anderson

Matt L. Matt L. · Accepted Answer · 2020-03-01T15:39:57

Possible word subsequences of certain maximum length means it considers every word with the 7 words before and 7 after the word?

As I understand it from the documentation, your description is not quite right. Since every possible sequence up to 15 words in length is evaluated, this would include a word with 7 words before and after it, but also that word with 5 words before and 3 after it, etc. (i.e. every possible N-gram between len(1) and len(15). Initial probabilities are derived, overlapping strings are compared and any overlaps with lower probabilities are discarded so that the final candidates are non-overlapping.

How can the neural net generate a score when its a binary classification entity and non-entity? Or do they mean the probability score for entity?

According to the Google AI Blog, "for each candidate the scoring neural net assigns a value (between 0 and 1) based on whether it represents a valid entity." So that would be a probability.

How to train a binary NER? Like any other NER except replace all entities to type 'entity' and then generate negative samples for non-entity?

Yes but, because this is a perceptron model, many binary classifiers will be trained and each will function as neuron in the model. It is important to note that the classifier only classifies entity/non-entity, not what type of entity it is. The post also discusses automatically generating negative samples by taking a positive sample (marked by a start token and end token in a string) and deliberately including the token before or after that entity. This technique would greatly increase the size of the training data.

How can this model be fast, as they claim, when it processes every word in the text plus 7 words before and after said word?

The computational cost of taking relatively small string (len 15) and fitting it to a model is small. The computational cost of dividing a longer string into substrings of this length is also quite small. Even if the text is 5000 words long (which would be huge for a query of this sort), that's only about 600,000 n-grams to evaluate, and most of those will have very low entity scores. As I understand it, the most significant computational cost of these approaches is training the model. This is where the "hashed charactergram embedding" technique discussed in the post is utilized.

Machine learning entity candidate scoring (not recognition)

1 Answers