38
votes

I'm trying to write a CSV parser using Scala parser combinators. The grammar is based on RFC4180. I came up with the following code. It almost works, but I cannot get it to correctly separate different records. What did I miss?

object CSV extends RegexParsers {
  def COMMA   = ","
  def DQUOTE  = "\""
  def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }
  def CR      = "\r"
  def LF      = "\n"
  def CRLF    = "\r\n"
  def TXT     = "[^\",\r\n]".r

  def file: Parser[List[List[String]]] = ((record~((CRLF~>record)*))<~(CRLF?)) ^^ { 
    case r~rs => r::rs
  }
  def record: Parser[List[String]] = (field~((COMMA~>field)*)) ^^ {
    case f~fs => f::fs
  }
  def field: Parser[String] = escaped|nonescaped
  def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case _ => List[List[String]]()
  }
}


println(CSV.parse(""" "foo", "bar", 123""" + "\r\n" + 
  "hello, world, 456" + "\r\n" +
  """ spam, 789, egg"""))

// Output: List(List(foo, bar, 123hello, world, 456spam, 789, egg)) 
// Expected: List(List(foo, bar, 123), List(hello, world, 456), List(spam, 789, egg))

Update: problem solved

The default RegexParsers ignore whitespaces including space, tab, carriage return, and line breaks using the regular expression [\s]+. The problem of the parser above unable to separate records is due to this. We need to disable skipWhitespace mode. Replacing whiteSpace definition to just [ \t]} does not solve the problem because it will ignore all spaces within fields (thus "foo bar" in the CSV becomes "foobar"), which is undesired. The updated source of the parser is thus

import scala.util.parsing.combinator._

// A CSV parser based on RFC4180
// http://tools.ietf.org/html/rfc4180

object CSV extends RegexParsers {
  override val skipWhitespace = false   // meaningful spaces in CSV

  def COMMA   = ","
  def DQUOTE  = "\""
  def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }  // combine 2 dquotes into 1
  def CRLF    = "\r\n" | "\n"
  def TXT     = "[^\",\r\n]".r
  def SPACES  = "[ \t]+".r

  def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ (CRLF?)

  def record: Parser[List[String]] = repsep(field, COMMA)

  def field: Parser[String] = escaped|nonescaped


  def escaped: Parser[String] = {
    ((SPACES?)~>DQUOTE~>((TXT|COMMA|CRLF|DQUOTE2)*)<~DQUOTE<~(SPACES?)) ^^ { 
      case ls => ls.mkString("")
    }
  }

  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }



  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case e => throw new Exception(e.toString)
  }
}
3
Why are the constants defined with def and not with val? Is there a benefit to it? - Sebastian N.
Check this out. tl;dr def uses less memory, val is faster. - rancidfishbreath
For compile-time constants there's really little difference - a "val" will initialize a field with that constant in the constructor then create a method which returns its value, while a "def" will simply return the constant - and for compile-time constants this is effectively free. - Score_Under
@rancidfishbreath it's an object so there's only 1 instance so if it saves any memory, this will be negligible - herman

3 Answers

31
votes

What you missed is whitespace. I threw in a couple bonus improvements.

import scala.util.parsing.combinator._

object CSV extends RegexParsers {
  override protected val whiteSpace = """[ \t]""".r

  def COMMA   = ","
  def DQUOTE  = "\""
  def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }
  def CR      = "\r"
  def LF      = "\n"
  def CRLF    = "\r\n"
  def TXT     = "[^\",\r\n]".r

  def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ opt(CRLF)
  def record: Parser[List[String]] = rep1sep(field, COMMA)
  def field: Parser[String] = (escaped|nonescaped)
  def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case _ => List[List[String]]()
  }
}
7
votes

With Scala Parser Combinators library out of the Scala standard library starting from 2.11 there is no good reason not to use the much more performant Parboiled2 library. Here is a version of the CSV parser in Parboiled2's DSL:

/*  based on comments in https://github.com/sirthias/parboiled2/issues/61 */
import org.parboiled2._
case class Parboiled2CsvParser(input: ParserInput, delimeter: String) extends Parser {
  def DQUOTE = '"'
  def DELIMITER_TOKEN = rule(capture(delimeter))
  def DQUOTE2 = rule("\"\"" ~ push("\""))
  def CRLF = rule(capture("\r\n" | "\n"))
  def NON_CAPTURING_CRLF = rule("\r\n" | "\n")

  val delims = s"$delimeter\r\n" + DQUOTE
  def TXT = rule(capture(!anyOf(delims) ~ ANY))
  val WHITESPACE = CharPredicate(" \t")
  def SPACES: Rule0 = rule(oneOrMore(WHITESPACE))

  def escaped = rule(optional(SPACES) ~
    DQUOTE ~ (zeroOrMore(DELIMITER_TOKEN | TXT | CRLF | DQUOTE2) ~ DQUOTE ~
    optional(SPACES)) ~> (_.mkString("")))
  def nonEscaped = rule(zeroOrMore(TXT | capture(DQUOTE)) ~> (_.mkString("")))

  def field = rule(escaped | nonEscaped)
  def row: Rule1[Seq[String]] = rule(oneOrMore(field).separatedBy(delimeter))
  def file = rule(zeroOrMore(row).separatedBy(NON_CAPTURING_CRLF))

  def parsed() : Try[Seq[Seq[String]]] = file.run()
}
3
votes

The default whitespace for RegexParsers parsers is \s+, which includes new lines. So CR, LF and CRLF never get a chance to be processed, as it is automatically skipped by the parser.