13
votes

I have a couple of code projects in C++/Python in which LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks. However, we also have some plain text outputs, such as an HTML version of the documentation (I already have code to write minimal markup for that) and a non-TeX-enabled plot renderer.

For these I would like to eliminate the TeX markup that is necessary for e.g. representing physical units. This includes non-breaking (thin) spaces, \text, \mathrm etc. It would also be nice to parse down things like \frac{#1}{#2} into #1/#2 for the plain text output (and use MathJax for the HTML). Due to the system that we've got at the moment, I need to be able to do this from Python, i.e. ideally I'm looking for a Python package, but a non-Python executable which I can call from Python and catch the output string would also be fine.

I'm aware of the similar question on the TeX StackExchange site, but there weren't any really programmatic solutions to that: I've looked at detex, plasTeX and pytex, which they all seem a bit dead and don't really do what I need: programmatic conversion of a TeX string to a representative plain text string.

I could try writing a basic TeX parser using e.g. pyparsing, but a) that might be pitfall-laden and help would be appreciated and b) surely someone has tried that before, or knows of a way to hook into TeX itself to get a better result?

Update: Thanks for all the answers... it does indeed seem to be a bit of an awkward request! I can make do with less than general parsing of LaTeX, but the reason for considering a parser rather than a load of regexes in a loop is that I want to be able to handle nested macros and multi-arg macros nicely, and get the brace matching to work properly. Then I can e.g. reduce txt-irrelevant macros like \text and \mathrm first, and handle txt-relevant ones like \frac last... maybe even with appropriate parentheses! Well, I can dream... for now regexes are not doing such a terrible job.

8
You are right, pyparsing of TeX is a brutal thing, but others have made some headway with this. matplotlib contains a pyparsing TeX parser that you can perhaps bend to your purpose. You could also try posting on the pyparsing mail list and see if some of those who have done TeX work in the past might be able to help.PaulMcG
Thanks: I'll look first in matplotlib... that's also a pre-existing dependency for one of my packages, so if I'm very lucky I can use it via the mpl API! Cheers :)andybuckley

8 Answers

11
votes

I understand this is an old post, but since this post comes up often in latex-python-parsing searches (as evident by Extract only body text from arXiv articles formatted as .tex), leaving this here for folks down the line: Here's a LaTeX parser in Python that supports search over and modification of the parse tree, https://github.com/alvinwan/texsoup. Taken from the README, here is sample text and how you can interact with it via TexSoup.

from TexSoup import TexSoup
soup = TexSoup("""
\begin{document}

\section{Hello \textit{world}.}

\subsection{Watermelon}

(n.) A sacred fruit. Also known as:

\begin{itemize}
\item red lemon
\item life
\end{itemize}

Here is the prevalence of each synonym.

\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}

\end{document}
""")

Here's how to navigate the parse tree.

>>> soup.section  # grabs the first `section`
\section{Hello \textit{world}.}
>>> soup.section.name
'section'
>>> soup.section.string
'Hello \\textit{world}.'
>>> soup.section.parent.name
'document'
>>> soup.tabular
\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}
>>> soup.tabular.args[0]
'c c'
>>> soup.item
\item red lemon
>>> list(soup.find_all('item'))
[\item red lemon, \item life]

Disclaimer: I wrote this lib, but it was for similar reasons. Regarding the post by Little Bobby Tales (regarding def), TexSoup doesn't handle definitions.

6
votes

A word of caution: It is much more difficult to write a complete parser for plain TeX than what you might think. The TeX-level (not LaTeX) \def command actually extends TeX's syntax. For example, \def\foo #1.{{\bf #1}} will expand \foo goo. into goo - Notice that the dot became a delimiter for the foo macro! Therefore, if you have to deal with any form of TeX, without restrictions on which packages may be used, it is not recommended to rely on simple parsing. You need TeX rendering. catdvi is what I use, although it is not perfect.

2
votes

Try detex (shipped with most *TeX distributions), or the improved version: http://code.google.com/p/opendetex/

Edit: oh, I see you tried detex already. Still, opendetex might work for you.

2
votes

I would try pandoc [enter link description here][1]. It is written in Haskell, but it is a really nice latex 2 whatever converter.

[1]: http://johnmacfarlane.net/pandoc/index.html .

1
votes

As you're considering using TeX itself for doing the rendering, I suspect that performance is not an issue. In this case you've got a couple of options: dvi2txt to fetch your text from a single dvi file (be prepared to generate one for each label) or even rendering dvi into raster images, if it's ok for you - that's how hevea or latex2html treats formulas.

0
votes

Building the other post Eduardo Leoni, I was looking at pandoc and I see that it comes with a standalone executable but also on this page it promises a way to build to a C-callable system library. Perhaps this is something that you can live with?

0
votes

Necroing this old thread, but found this nifty library called pylatexenc that seems to do almost exactly what the OP was after:

from pylatexenc.latex2text import LatexNodes2Text


LatexNodes2Text().latex_to_text(r"""\
\section{Euler}
\emph{This} bit is \textbf{very} clever:
\begin{equation}
    \mathrm{e}^{i \pi} + 1 = 0  % wow!!
\end{equation}
where
\[
\mathrm{e} = \lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n
\]
""")

which produces


§ EULER

This bit is very clever:

    e^i π + 1 = 0

where

    e = lim_n →∞(1 + 1/n)^n

As you can see, the result is not perfect for the equations, but it does a great job of stripping and converting all the tex commands.

-4
votes

LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks

This is your mistake. You shouldn't have done that.

Use RST or some other -- better -- markup language.

Use Docutils to create LaTeX and HTML from the RST source.