So, I'm writing a small parser that will extract all <td>
tag content with specific class, like this one <td class="liste">some content</td> --> Right "some content"
I will be parsing large html
file but I don't really care about all the noise, so idea was to consume all characters until I reach <td class="liste">
than I'd consume all characters (content) until </td>
and return content string.
This works fine if last element in a file is my td.liste
tag, but if I have some text after it or eof
than my parser consumes it and throws unexpected end of input
if you execute parseMyTest test3
.
-- EDIT
See end of test3
to understand what is the edge case.
Here is my code so far :
import Text.Parsec
import Text.Parsec.String
import Data.ByteString.Lazy (ByteString)
import Data.ByteString.Char8 (pack)
colOP :: Parser String
colOP = string "<td class=\"liste\">"
colCL :: Parser String
colCL = string "</td>"
col :: Parser String
col = do
manyTill anyChar (try colOP)
content <- manyTill anyChar $ try colCL
return content
cols :: Parser [String]
cols = many col
test1 :: String
test1 = "<td class=\"liste\">Hello world!</td>"
test2 :: String
test2 = read $ show $ pack test1
test3 :: String
test3 = "\n\r<html>asdfasd\n\r<td class=\"liste\">Hello world 1!</td>\n<td class=\"liste\">Hello world 2!</td>\n\rasldjfasldjf<td class=\"liste\">Hello world 3!</td><td class=\"liste\">Hello world 4!</td>adsafasd"
parseMyTest :: String -> Either ParseError [String]
parseMyTest test = parse cols "test" test
btos :: ByteString -> String
btos = read . show