2
votes

So, I'm writing a small parser that will extract all <td> tag content with specific class, like this one <td class="liste">some content</td> --> Right "some content"

I will be parsing large html file but I don't really care about all the noise, so idea was to consume all characters until I reach <td class="liste"> than I'd consume all characters (content) until </td> and return content string.

This works fine if last element in a file is my td.liste tag, but if I have some text after it or eof than my parser consumes it and throws unexpected end of input if you execute parseMyTest test3.

-- EDIT
See end of test3 to understand what is the edge case.

Here is my code so far :

import Text.Parsec
import Text.Parsec.String

import Data.ByteString.Lazy (ByteString)
import Data.ByteString.Char8 (pack)

colOP :: Parser String
colOP = string "<td class=\"liste\">"

colCL :: Parser String
colCL = string "</td>"

col :: Parser String
col = do
  manyTill anyChar (try colOP)
  content <- manyTill anyChar $ try colCL
  return content

cols :: Parser [String]
cols = many col

test1 :: String
test1 = "<td class=\"liste\">Hello world!</td>"

test2 :: String
test2 = read $ show $ pack test1

test3 :: String
test3 = "\n\r<html>asdfasd\n\r<td class=\"liste\">Hello world 1!</td>\n<td class=\"liste\">Hello world 2!</td>\n\rasldjfasldjf<td class=\"liste\">Hello world 3!</td><td class=\"liste\">Hello world 4!</td>adsafasd"

parseMyTest :: String -> Either ParseError [String]
parseMyTest test = parse cols "test" test

btos :: ByteString -> String
btos = read . show
1
btw - for parsing HTML I find that tagsoup works well. - ErikR
I know, but I'm exploring parsec at the moment :D - Reygoch

1 Answers

3
votes

I created a combinator skipTill p end which applies p until end matches and then returns what end returns.

By contrast, manyTill p end applies p until end matches and then returns what the p parsers matched.

import Text.Parsec
import Text.Parsec.String

skipTill :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m end
skipTill p end = scan
    where
      scan  = end  <|> do { p; scan }

td :: Parser String
td = do
  string "("
  manyTill anyChar (try (string ")"))

tds = do r <- many (try (skipTill anyChar (try td)))
         many anyChar -- discard stuff at end
         return r

test1 = parse tds "" "111(abc)222(def)333" -- Right ["abc", "def"]

test2 = parse tds "" "111"                 -- Right []

test3 = parse tds "" "111(abc"             -- Right []

test4 = parse tds "" "111(abc)222(de"      -- Right ["abc"]

Update

This also appears to work:

tds' = scan
  where scan = (eof >> return [])
               <|> do { r <- try td; rs <- scan; return (r:rs) }
               <|> do { anyChar; scan }