7
votes

I want to parse input strings like this: "this is \"test \" message \"sample\" text"

Now, I wrote a parser for parsing individual text without any quotes:

parseString :: Parser String
parseString = do
  char '"'
  x <- (many $ noneOf "\"")
  char '"'
  return x

This parses simple strings like this: "test message"

Then I wrote a parser for quoted strings:

quotedString :: Parser String
quotedString = do
  initial <- string "\\\""
  x <- many $ noneOf "\\\"" 
  end <- string "\\\""
  return $ initial ++ x ++ end

This parsers for strings like this: \"test message\"

Is there a way that I can combine both the parsers so that I obtain my desired objective ? What exactly is the idomatic way to tackle this problem ?

6
Why do you want to strip the initial and final quotation marks, but leave the escaping backslashes intact? I would think you'd want to parse the input "\"ab\\\"c\"" as either "\"ab\\\"c\"" (parsing strictly for validation) or as "ab\"c", but it seems you want "ab\\\"c", which doesn't seem so obviously useful. - dfeuer
@dfeuer No particular reason, was just playing around with Parsec. - Sibi

6 Answers

21
votes

This is what I would do:

escape :: Parser String
escape = do
    d <- char '\\'
    c <- oneOf "\\\"0nrvtbf" -- all the characters which can be escaped
    return [d, c]

nonEscape :: Parser Char
nonEscape = noneOf "\\\"\0\n\r\v\t\b\f"

character :: Parser String
character = fmap return nonEscape <|> escape

parseString :: Parser String
parseString = do
    char '"'
    strings <- many character
    char '"'
    return $ concat strings

Now all you need to do is call it:

parse parseString "test" "\"this is \\\"test \\\" message \\\"sample\\\" text\""

Parser combinators are a bit difficult to understand at first, but once you get the hang of it they are easier than writing BNF grammars.

3
votes
quotedString = do
    char '"'
    x <- many (noneOf "\"" <|> (char '\\' >> char '\"'))
    char '"'
    return x

I believe, this should work.

3
votes

In case somebody is looking for a more out of the box solution, this answer in code-review provides just that. Here is a complete example with the right imports:

import           Text.Parsec
import           Text.Parsec.Language
import           Text.Parsec.Token

lexer :: GenTokenParser String u Identity
lexer = makeTokenParser haskellDef

strParser :: Parser String
strParser = stringLiteral lexer

parseString :: String -> Either ParseError String
parseString = parse strParser ""
1
votes

I wanted to parse quoted strings and remove any backslashes used for escaping during the parsing step. In my simple language, the only escapable characters were double quotes and backslashes. Here is my solution:

quotedString = do
  string <- between (char '"') (char '"') (many quotedStringChar)
  return string
  where
    quotedStringChar = escapedChar <|> normalChar
    escapedChar = (char '\\') *> (oneOf ['\\', '"'])
    normalChar = noneOf "\""
0
votes

I prefer the following because it is easier to read:

quotedString :: Parser String
quotedString = do
    a <- string "\""
    b <- concat <$> many quotedChar
    c <- string "\""
    -- return (a ++ b ++ c) -- if you want to preserve the quotes
    return b
    where quotedChar = try (string "\\\\")
                   <|> try (string "\\\"")
                   <|> ((noneOf "\"\n") >>= \x -> return [x] )

Aadit's solution may be faster because it does not use try but it's probably harder to read.

Note that it is different from Aadit's solution. My solution ignores escaped things in the string and really only cares about \" and \\.

For example, let's assume you have a tab character in the string. My solution successfully parses "\"\t\"" to Right "\t". Aadit's solutions says unexpected "\t" expecting "\\" or "\"".

Also note that Aadit's solution only accepts 'valid' escapes. For example, it rejects "\"\\a\"". \a is not a valid escape sequence (well according to man ascii, it represents the system bell and is valid). My solution just returns Right "\\a".

So we have two different use cases.

  • My solution: Parse quoted strings with possibly escaped quotes and escaped escapes

  • Aadit's solution: Parse quoted strings with valid escape sequences where valid escapes means "\\\"\0\n\r\v\t\b\f"

0
votes

elaborating on @Priyatham response

pEscString::Char->Parser String
pEscString e= do
  char e;
  s<-many (
    do{char '\\';c<-anyChar;return ['\\',c]}
    <|>many1 (noneOf (e:"\\")))
  char e
  return$concat s