0
votes

I'm writing a screen scraper application monitoring a text-only chat window. Text is added at the bottom of the window.

The application takes screenshot of the chat window. If a change has occurred since last screenshot (new_screenshot != old_screenshot), the screenshot is saved.

After X time, all images are merged to one image, where the oldest image is on the top. This large image is send to a server for OCR, and a string of text is returned.

Problem: How to sort out redundant text?

Example:

  • Chat window is 5 lines high and is initially empty.
  • The solution must work with empty and not-empty initial chat window.
  • More than one line can be added at each screenshot. The same line can come multiple times, but never two times in a row so just deduplicating is not enough (so using sorted(set(text.split('\n'))) would not be sufficiant)

Input to algorithm:

1 Lorem ipsum dolor sit amet,
1 Lorem ipsum dolor sit amet,
2 consectetur adipiscing elit
1 Lorem ipsum dolor sit amet,
2 consectetur adipiscing elit
3 Mauris porttitor enim sed tincidunt interdum.
1 Lorem ipsum dolor sit amet,
2 consectetur adipiscing elit
3 Mauris porttitor enim sed tincidunt interdum.
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
1 Lorem ipsum dolor sit amet,
2 consectetur adipiscing elit
3 Mauris porttitor enim sed tincidunt interdum.
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
5 Nam aliquet velit vel elementum tristique.
2 consectetur adipiscing elit
3 Mauris porttitor enim sed tincidunt interdum.
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
5 Nam aliquet velit vel elementum tristique.
6 Donec ac tincidunt urna.
3 Mauris porttitor enim sed tincidunt interdum.
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
5 Nam aliquet velit vel elementum tristique.
6 Donec ac tincidunt urna.
7 Proin pretium, metus non porttitor lobortis, tortor sem rhoncus urna
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
5 Nam aliquet velit vel elementum tristique.
6 Donec ac tincidunt urna.
7 Proin pretium, metus non porttitor lobortis, tortor sem rhoncus urna
8 quis finibus leo lorem sed lacus.
5 Nam aliquet velit vel elementum tristique.
6 Donec ac tincidunt urna.
7 Proin pretium, metus non porttitor lobortis, tortor sem rhoncus urna
8 quis finibus leo lorem sed lacus.
1 Lorem ipsum dolor sit amet,

Expected result:

1 Lorem ipsum dolor sit amet,
2 consectetur adipiscing elit
3 Mauris porttitor enim sed tincidunt interdum.
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
5 Nam aliquet velit vel elementum tristique.
6 Donec ac tincidunt urna.
7 Proin pretium, metus non porttitor lobortis, tortor sem rhoncus urna
8 quis finibus leo lorem sed lacus.
1 Lorem ipsum dolor sit amet,

1
Why is Lorem ipsum dolor sit amet removed after first one but reappears only at the end?juvian
To illustrate that the exact same line can reappear, and should only be ignored if the two similar lines appears twice in a rowVingtoft
Something like this tpcg.io/4GcfOx ?juvian
Well it does produce the expected output, but is the logic what you were seeking?juvian

1 Answers

1
votes

Here is the code from what I understood you want (add new lines seen but keeping a history of 5 lines to avoid repeats):

history = 5
lastSeen = dict()
result = [] 
for idx, line in enumerate(text.split('\n')):
    if line not in lastSeen:
        result.append(line)
    else:
        if lastSeen[line] + history < idx:
            result.append(line)
    lastSeen[line] = idx