Python comment-preserving parsing using only builtin libraries?

Question

I wrote a library using just ast and inspect libraries to parse and emit [uses astor on Python < 3.9] internal Python constructs.

Just realised that I really need to preserve comments afterall. Preferably without resorting to a RedBaron or LibCST; as I just need to emit the unaltered commentary; is there a clean and concise way of comment-preserving parsing/emitting Python source with just stdlib?

inspect.getsource() returns the source code of an object including comments. Is this what you need? — rchome
No because I am modifying AST nodes, changing: docstrings; ast.Assign; ast.AnnAssign; and ast.FunctionDef/ast.AsyncFunctionDef. Inferring types, adding them as a type comment xor annotation; converting between docstring formats (incl. adding/remove types); and updating the return attribute of a function definition. — Samuel Marks

VirtualScooter VirtualScooter · Accepted Answer · 2021-12-26T01:35:39

Comments can be preserved by merging them back into the generated source code by capturing them with the tokenizer.

Given a toy program in a program variable, we can demonstrate how comments get lost in the AST:

import ast

program = """
# This comment lost
p1v = 4 + 4
p1l = ['a', # Implicit line joining comment for a lost
       'b'] # Ending comment for b lost
def p1f(x):
    "p1f docstring"
    # Comment in function p1f lost
    return x
print(p1f(p1l), p1f(p1v))
"""
tree = ast.parse(program)
print('== Full program code:')
print(ast.unparse(tree))

The output shows all comments gone:

== Full program code:
p1v = 4 + 4
p1l = ['a', 'b']

def p1f(x):
    """p1f docstring"""
    return x
print(p1f(p1l), p1f(p1v))

However, if we scan the comments with the tokenizer, we can use this to merge the comments back in:

from io import StringIO
import tokenize

def scan_comments(source):
    """ Scan source code file for relevant comments
    """
    # Find token for comments
    for k,v in tokenize.tok_name.items():
        if v == 'COMMENT':
            comment = k
            break
    comtokens = []
    with StringIO(source) as f:
        tokens = tokenize.generate_tokens(f.readline)
        for token in tokens:
            if token.type != comment:
                continue
            comtokens += [token]
    return comtokens

comtokens = scan_comments(program)
print('== Comment after p1l[0]\n\t', comtokens[1])

Output (edited to split long line):

== Comment after p1l[0]
     TokenInfo(type=60 (COMMENT),
               string='# Implicit line joining comment for a lost',
               start=(4, 12), end=(4, 54),
               line="p1l = ['a', # Implicit line joining comment for a lost\n")

Using a slightly modified version of ast.unparse(), replacing methods maybe_newline() and traverse() with modified versions, you should be able to merge back in all comments at their approximate locations, using the location info from the comment scanner (start variable), combined with the location info from the AST; most nodes have a lineno attribute.

Not exactly. See for example the list variable assignment. The source code is split out over two lines, but ast.unparse() generates only one line (see output in the second code segment).

Also, you need to ensure to update the location info in the AST using ast.increment_lineno() after adding code.

It seems some more calls to maybe_newline() might be needed in the library code (or its replacement).

Python comment-preserving parsing using only builtin libraries?

1 Answers