Python pygments lexer state preservation

Question

Running pygments default lexer on the following c++ text: class foo{};, results in this:

(Token.Keyword, 'class')
(Token.Text, ' ')
(Token.Name.Class, 'foo')
(Token.Punctuation, '{')
(Token.Punctuation, '}')
(Token.Punctuation, ';')

Note that the toke foo has the type Token.Name.Class.

If i change the class name to foobar i want to be able to run the default lexer only on the touched tokens, in this case original tokens foo and {.

Q: How can i save the lexer state so that tokenizing foobar{ will give tokens with type Token.Name.Class?

Having this feature would optimize syntax highlighting for large source files that suffered a change (user is typing text) right in the middle of the file for example. There seems no documented way of doing this and no information on how to do this using the default pygments lexers.

Are there any other syntax highlighting systems that have support for this behavior ?

EDIT:

Regarding performance here is an example: http://tpcg.io/ESYjiF

Have you looked at the performance impacts? I mean most lexer would do a full parse as such a small delta may even mean a complete change or break in the rest of the tokens. Like changing foo to foo { which introduces another bracket and rest whole code meaning will actually change. So In any case it may not be a great idea — Tarun Lalwani
@Tarun Lalwani On a decent machine with a 200 kb file (which is indeed large) i get 0.5MS total lexer time. With code formatter i get 0.5 seconds. While the lexer time is "acceptable", the total processing has unacceptable performance (at least to my standards) — Raxvan
@Tarun Lalwani I added a test code also with a 33 kb file. The lexer result seems to be a generator so that's why the initial lexer time is very small , however iterating over the tokens reveals the total time spent parsing the code. — Raxvan
The feature you want to implement is called Rename Symbol,you can find it in vs code when you press F2.it's can be done by rename the entry in the global stringtable if you work with something like Flex. — obgnaw

Arount Arount · Accepted Answer · 2018-06-23T15:57:25

From my understanding of the source code what you want is not possible.

I won't dig and try to explain every single relevant lines of code, but basically, here is what happend:

Your Lexer class is pygments.lexers.c_cpp.CLexer, which inherits from pygments.lexer.RegexLexer.
pygments.lex(lexer, code) function do nothing more than calling get_tokens method on lexer and handle errors.
lexer.get_tokens basically parse source code in unicode string and call self.get_tokens_unprocessed
get_tokens_unprocessed is defined by each Lexer in your case the relevant method is pygments.lexers.c_cpp.CFamilyLexer.get_tokens_unprocessed.
CFamilyLexer.get_tokens_unprocessed basically get tokens from RegexLexer.get_tokens_unprocessed and reprocess some of them.

Finally, RegexLexer.get_tokens_unprocessed loop on defined token types (something like (("function", ('pattern-to-find-c-function',)), ("class", ('function-to-find-c-class',)))) and for each type (function, class, comment...) find all matches within the source text, then process the next type.

This behavior make what you want impossible because it loops on token types, not on text.

To make more obvious my point, I added 2 lines of code in the lib, file: pygments/lexer.py, line: 628

for rexmatch, action, new_state in statetokens:
    print('looking for {}'.format(action))
    m = rexmatch(text, pos)
    print('found: {}'.format(m))

And ran it with this code:

import pygments
import pygments.lexers

lexer = pygments.lexers.get_lexer_for_filename("foo.h")
sample="""
class foo{};
"""
print(list(lexer.get_tokens(sample)))

Output:

[...]
looking for Token.Keyword.Reserved
found: None
looking for Token.Name.Builtin
found: None
looking for <function bygroups.<locals>.callback at 0x7fb1f29b52f0>
found: None
looking for Token.Name
found: <_sre.SRE_Match object; span=(6, 9), match='foo'>
[...]

As you can see, the token types are what the code iterate on.

Taking that and (as Tarun Lalwani said in comments) the fact that a single new character can break the whole source-code structure, you cannot do better than re-lexing the whole text at each update.

Python pygments lexer state preservation

1 Answers