I'm writing a screen scraper application monitoring a text-only chat window. Text is added at the bottom of the window.
The application takes screenshot of the chat window. If a change has occurred since last screenshot (new_screenshot != old_screenshot), the screenshot is saved.
After X time, all images are merged to one image, where the oldest image is on the top. This large image is send to a server for OCR, and a string of text is returned.
Problem: How to sort out redundant text?
Example:
- Chat window is 5 lines high and is initially empty.
- The solution must work with empty and not-empty initial chat window.
- More than one line can be added at each screenshot. The same line can come multiple times, but never two times in a row so just deduplicating is not enough (so using
sorted(set(text.split('\n')))
would not be sufficiant)
Input to algorithm:
1 Lorem ipsum dolor sit amet,
1 Lorem ipsum dolor sit amet,
2 consectetur adipiscing elit
1 Lorem ipsum dolor sit amet,
2 consectetur adipiscing elit
3 Mauris porttitor enim sed tincidunt interdum.
1 Lorem ipsum dolor sit amet,
2 consectetur adipiscing elit
3 Mauris porttitor enim sed tincidunt interdum.
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
1 Lorem ipsum dolor sit amet,
2 consectetur adipiscing elit
3 Mauris porttitor enim sed tincidunt interdum.
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
5 Nam aliquet velit vel elementum tristique.
2 consectetur adipiscing elit
3 Mauris porttitor enim sed tincidunt interdum.
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
5 Nam aliquet velit vel elementum tristique.
6 Donec ac tincidunt urna.
3 Mauris porttitor enim sed tincidunt interdum.
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
5 Nam aliquet velit vel elementum tristique.
6 Donec ac tincidunt urna.
7 Proin pretium, metus non porttitor lobortis, tortor sem rhoncus urna
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
5 Nam aliquet velit vel elementum tristique.
6 Donec ac tincidunt urna.
7 Proin pretium, metus non porttitor lobortis, tortor sem rhoncus urna
8 quis finibus leo lorem sed lacus.
5 Nam aliquet velit vel elementum tristique.
6 Donec ac tincidunt urna.
7 Proin pretium, metus non porttitor lobortis, tortor sem rhoncus urna
8 quis finibus leo lorem sed lacus.
1 Lorem ipsum dolor sit amet,
Expected result:
1 Lorem ipsum dolor sit amet,
2 consectetur adipiscing elit
3 Mauris porttitor enim sed tincidunt interdum.
4 Morbi elementum erat nec nulla auctor, eget porta odio aliquet.
5 Nam aliquet velit vel elementum tristique.
6 Donec ac tincidunt urna.
7 Proin pretium, metus non porttitor lobortis, tortor sem rhoncus urna
8 quis finibus leo lorem sed lacus.
1 Lorem ipsum dolor sit amet,