As many have pointed out, emacspeak has been the enduring solution cross platform for many of the older hackers out there. Since it supports Linux and Mac out of the box, it has become my prefered means of developing Windows egnostic projects.
To the issue of actually getting down syntax through an auditory one as opposed to a visual one, I have found that there exists a variety of techniques to get one close if not on the same playing field.
Auditory icons can stand in place for verbal descriptors for one example. You can, put tones for how far a line is indented. The longer the tone, the further the indent. Since tones can play in parallel with text to speech, the information comes through in the same timeframe and doesn't serialize the communication of something so basic.
Braille can quickly and precisely decode to the user the exact syntax of a line. This is something more useful for people who use braille in daily life; the biggest advantage is random access to the contents of the display. Refreshable units typically have router keys above each character cell which can place the cursor to that cell. No fiddling with arrow keys O(n) op vs O(1) access.
Auditory dimensionality (pitch, rate, volume, inflection, richness, stress, etc) can convey a concept (keyword, class, variable, error, etc). For example, comments can be read in a monotone inflection...suiting, if I might say so :).
Emacs and other editors to lesser extents (Visual Studio) allow a coder to peruse a program symantically (next block, fold block, down defun, jump to def, walk up the parse tree, etc). You can very quickly get the "big" picture of the structure of an entire project doing this; with extensions like Cedet, you can get the goodness of VS/Eclipse/etc cross platform and in a textual editor.
Could probably go on and on, but that in a nutshell, is the basis of why a few of us are out there hacking away in industry, adacdemia, or in our basements :).