0
votes

I am using Pygments as part of MarkDown, Python, and LaTeX (through minted). For these reasons, I cannot easily change to another tool and would like to find a solution in Pygments.

What I want to achieve is to have "special highlighting" beyond plain syntax highlighting. This includes, for example, keeping specific terminal output in its original colour. Or adding extra styles to more easily highlight arguments for the user to fill in. An example is definitely going to make this easier to understand.

In the Swift manual written by Apple, the arguments that are there for you to fill in are highlighted in balloons. This makes them a lot easier to identify, and a balloon can cover a range of tokens:

enter image description here

I want to achieve something similar with Pygments. Merely changing the styling doesn't work. You want these things to be normally styled for their type of token when they are not inside a balloon, and you don't want any of the type's specific highlighting when it is inside a balloon. I don't think these types of balloons are possible in current vanilla Pygments. I think I need a new type of token to achieve what I want. I would want to use this for a variety of languages, so ideally I want to find the way to do this with the least amount of modifications. Sadly, I don't think I can achieve what I want with Pygments filters. My gut feeling here is that I need to rewrite the lexers for all the languages I want to use, but I think I can do this by subclassing them.

To summarise, what I want is that I feed code such as

§label name§: while §condition§ {

and that, in for example HTML output, something like

<pre><span class="balloon">label name</span>: <span class="k">while</span><span class="balloon">condition</span> { </pre>

comes out. I'm not pinning myself down on the use of the symbol § for this new environment or anything like that, it is merely an example of the intended behaviour.

Since modifying the lexer for each language I use is quite the endeavour, and since Pygments' documentation is quite spartan, I would like to ask how this can be solved and whether I am overlooking something. Am I right in assuming I need a new token and need to rewrite each and every lexer? Or is there a smarter way?

1
I think this question might be "too broad" so just a comment: I don't think you want/need to replace the class in the output. Simply add balloon as a second class on the span (<span class="k balloon">). Why not use an HTML filter to pass Pygments output through? The filter can identify the special elements and modify them accordingly. I suppose you could identify the tokens to modify with some extra characters, but why not use the same method Pygments uses for highlighting lines; pass a list in: filter(html, ['label name', 'condition']).Waylan
That's interesting, thanks. As you mention, it might indeed not be needed to add a second class. Nevertheless, internally Pygments will need to be able to distinguish between the different kinds of tokens (I don't know well enough how the LaTeX output works). Your suggestion of the highlighting is a good one. This may be exactly what is needed and may exactly be the kind of input I was looking for. I will look into that tomorrow and see if that does the trick.kvaruni
I gave it a try and a filter did not work. Nevertheless, your suggestion did lead to a solution. I first started with subclassing the existing lexer and adding a filter, but a few iterations of the code later it seems possible to do achieve this by subclassing the desired languages one by one with a code one-liner. I will post this as an answer, although I would love additional input to see if there is still a better way.kvaruni

1 Answers

0
votes

Thanks to Waylan's comment I have been put on the right track and found a solution that works:

from pygments import token
from pygments.lexer import inherit
from pygments.lexers import Python3Lexer, SwiftLexer


# create a new token and associate with it a new short style name
token.STANDARD_TYPES[token.Token.Balloon] = "balloon"
Balloon = token.Token.Balloon


# create a callback which ignores the special § characters
def process_balloon(_, match):
    yield match.start(), Balloon, match.group(1)


# subclass the Python 3 lexer to identify ballooned text
# when subclassing, the tokens dictionaries will be merged
class Python3BalloonLexer(Python3Lexer):
    tokens = {
        "root": [(r'§(.*?)§', process_balloon),  # balloons are processed first
                 inherit]  # and then merge the superclass tokens here
    }

# do the same for the Swift language with a one-liner
class SwiftBalloonLexer(SwiftLexer):
    tokens = { "root": [(r'§(.*?)§', process_balloon), inherit] }


# example output below
from pygments import highlight
from pygments.formatters import HtmlFormatter
my_lexer_python = Python3BalloonLexer()
my_lexer_swift = SwiftBalloonLexer()
print(highlight('print(§"Hello World"§)', my_lexer_python, HtmlFormatter()))
print(highlight('§label name§: while §condition§ {\n  §statements§\n}', 
      my_lexer_swift, HtmlFormatter()))

In a nutshell, this solution creates a new token to ensure there is no conflict with the existing tokens. Then, for each language in which a balloon is desired, that language is subclassed to add support for the new token. This token is processed by a callback function which only takes the contents in between the § symbols. Once all balloons are found, the normal lexer takes over and processes the rest of the language.

This solution seems to tick most boxes, except for the need to subclass each language one by one. This is a tad cumbersome, but not too much of a problem. The approach is still a bit fragile as it assumes that § is only ever used to highlight these balloons.

A similar approach could be taken to force specific styles for e.g. text that must always be in the same colour, simply by adding/stacking further lexer extensions through subclassing. This, of course, increases the risk of potential collisions as that means adding more special symbols such as § which may be used in other places of your code already.