2
votes

We have a file that contains data that we want to match to a case class. I know enough to brute force it but looking for an idiomatic way in scala.

Given File:

#record
name:John Doe
age: 34

#record
name: Smith Holy
age: 33 

# some comment

#record
# another comment
name: Martin Fowler
age: 99 

(field values on two lines are INVALID, e.g. name:John\n Smith should error)

And the case class

case class Record(name:String, age:Int) 

I Want to return a Seq type such as Stream:

val records: Stream records

The couple of ideas i'm working with but so far haven't implemented is:

  1. Remove all new lines and treat the whole file as one long string. Then grep match on the string "((?!name).)+((?!age).)+age:([\s\d]+)" and create a new object of my case class for each match but so far my regex foo is low and can't match around comments.

  2. Recursive idea: Iterate through each line to find the first line that matches record, then recursively call the function to match name, then age. Tail recursively return Some(new Record(cumulativeMap.get(name), cumulativeMap.get(age)) or None when hitting the next record after name (i.e. age was never encountered)

  3. ?? Better Idea?

Thanks for reading! The file is more complicated than above but all rules are equal. For the curious: i'm trying to parse a custom M3U playlist file format.

4

4 Answers

2
votes

I'd use kantan.regex for a fairly trivial regex based solution.

Without fancy shapeless derivation, you can write the following:

import kantan.regex._
import kantan.regex.implicits._

case class Record(name:String, age:Int) 
implicit val decoder = MatchDecoder.ordered(Record.apply _)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList

This yields:

List(Success(Record(John Doe,34)), Success(Record(Smith Holy,33)), Success(Record(Martin Fowler,99)))

Note that this solution requires you to hand-write decoder, but it can often be automatically derived. If you don't mind a shapeless dependency, you could simply write:

import kantan.regex._
import kantan.regex.implicits._
import kantan.regex.generic._

case class Record(name:String, age:Int) 
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList

And get the exact same result.

Disclaimer: I'm the library's author.

1
votes

You could use Parser Combinators.

If you have the file format specification in BNF or can write one, then Scala can create a parser for you from those rules. This may be more robust than hand-made regex based parsers. It's certainly more "Scala".

1
votes

I don't have much experience in Scala, but could these regexes work:

You could use (?<=name:).* to match name value, and (?<=age:).* to match the age value. If you use this, remove spaces in found matches, otherwise name: bob will match bob with a space before, you might not want that.

If name: or any other tag is in comment, or comment is after value, something will be matched. Please leave a comment if you want to avoid that.

1
votes

You could try this:

Path file = Paths.get("file.txt");
val lines = Files.readAllLines(file, Charset.defaultCharset());

val records = lines.filter(s => s.startsWith("age:") || s.startsWith("name:"))
                   .grouped(2).toList.map {
  case List(a, b) => Record(a.replaceAll("name:", "").trim,
                            b.replaceAll("age:", "").trim.toInt)
}