2
votes

I'm writing a recursive descent parser in Ruby that uses regexp to match terminals. Terminals are actually regexps, and are matched against the current position in the string.

The problem is that the terminal regex can contain anything, including regular expressions that match newlines. For example, a terminal that matches anything between parenthesis /\([^\)]*\)/ will consume whitespace, including the newlines I need to count. I've come up with a several solutions, but they all have drawbacks which I don't particularly like:

  1. Whenever a terminal is matched, count all occurences of \n in the match. This would effectively mean that every string is matched twice instead of once,

  2. Instead of storing the current line, I could store the current position of the string, and obtain the line and column number only when I need it by traversing the string. Obviously problematic because the whole string is traversed every time the line number is needed.

  3. Instead of allowing regexps as terminals, I could allow a simpler form of matchers, similar to what ANTLR allows, and then manually match the string, counting newlines. However, it would require no small amount of extra work and a loss of the matching power regular expressions have.

I'm leaning towards the third solution, however I'd like to see if anyone has dealt with a similar problem and has a better solution which would save me the trouble.

1
It is too long. Most people would not want to read your question.sawa
You're right, I'll try to make it as short as possible.migimunz

1 Answers

1
votes

You could use your solution 2, but with a "line index" of your source file.

You make a first phase to obtain an array of positions of lines starts. You can then obtain the line number of a position in O(log n) by binary search (n being the count of lines). By the way, it gets you as well the position in the line as you know pos - lines_start[line] on O(1), this is precious in error reporting for lines of code that are not trivial.