2
votes

I've got a file with syntactically correct Lua 5.1 source code.

I've got a position (line and character offset) inside that file.

I need to get an offset in bytes to the closing parenthesis of the innermost function() body that contains that position (or figure out that the position belongs to the main chunk of the file).

I.e.:

local function foo()
                    ^ result
  print("bar")
           ^ input
end
local foo = function()
                      ^ result
  print("bar")
           ^ input
end
local foo = function()
  return function()
                   ^ result
    print("bar")
             ^ input
  end
end

...And so on.

How do I do that robustly?

1
What sorts of libraries can you use for that? You are probably going to need a Lua parser for that.hugomg
Whatever I need, as long as it is sane (and, preferably, not under GPL).Alexander Gladysh
Actually, in this specific case, I think that it should be doable with regexps alone (possibly while operating on reversed source). But a library-based solution will be preferable.Alexander Gladysh
"How do I do that robustly?" You write a parser. If you want to do serious source code manipulation "robustly", you write a parser. Lua's syntax is not exactly complex. So just get your favorite parsing tools and write one.Nicol Bolas
Well, there are several Lua parsers out there. Metalua, luafish, Cheese, LuaParse, LuaInspect, Leg etc.Alexander Gladysh

1 Answers

0
votes

EDIT: My original answer did not take into account the "innermost" requirement. I've since taken that into account

To make things "robust," there are a few considerations.

First of all, it's important that you skip over string and comment contents, to avoid incorrect output in situations like:

foo = function()
    print(" function() ")
    -- function()
    print("bar")
            ^ input
end

This can be somewhat difficult, considering Lua's nested string and comment syntax. Consider, for example, a situation where the input begins in a nested string or comment:

foo = function()
    print([[
        bar = function()
            print("baz")
                    ^ input
        end
    ]])
end

Consequently, if you want a completely robust system, it is not acceptable to only parse backwards until you hit the end of a function parameter list, because you may not have parsed backwards far enough to reach a [[ which would invalidate your match. It is therefore necessary to parse the entire file up to your position (unless you're okay with incorrect matches in these weird situations. If this is an editor plugin, these "incorrect" results may actually be desirable, because they would allow you to edit lua code which is stored in string literal form inside other lua code using the same plugin).

Because the particular syntax that you're trying to match doesn't have any kind of "nesting", a full-blown parser isn't needed. You will need to maintain a stack, however, to keep track of scope. With that in mind, all you need to do is step through the source file character-by-character from the beginning, applying the following logic:

  1. Every time a " or ' is encountered, ignore the characters up to the closing " or '. Be careful to handle escapes like \" and \\
  2. Every time a -- is encountered, ignore the characters up to the closing newline for the comment. Be careful to only do this if the comment is not a multiline comment.
  3. Every time a multiline string opening symbol is encountered (such as [[, [=[, etc), or a multiline comment symbol is encountered (such as --[[ or --[=[, etc) ignore the characters up until the closing square brackets with the proper number of matching equals signs between them.
  4. When a word boundary is encountered check to see if the characters after it could begin a block which ends with an end (for example, if, while, for, function, etc. DO NOT include repeat). If so, push the position on the scope stack. A "word boundary" in this case is any character which could not be used a lua identifier (this is to prevent matches in cases like abcfunction()). The beginning of the file is also considered a word boundary.
  5. If a word boundary is encountered and it is followed by end, pop the top element of the stack. If the stack has no elements, complain about a syntax error.

When you finally step forward and reach your "input" position, pop elements from the stack until you find a function scope. Step forward from that position to the next ), ignoring )'s in comments (which could theoretically be found in an argument list if it spans multiple lines or contains inline --[[ ]] comments). That position is your result.

This should handle every case, including situations where the function syntactic sugar is used, like

function foo()
    print("bar")
end

which you did not include in your example but which I imagine you still want to match.