0
votes

I have a text file that looks like this:

ABC gibberish
DEF gibberish
ABC text
DEF random

I only want to keep the lines that start with ABC. This is what I've tried:

val lines = sc.textFile("textfile.txt")
val reg = "^ABC".r
val abc_lines = lines.filter(x => reg.pattern.matcher(x).matches)
abc_lines.count()

The count returns 0 so nothing matches, where did I go wrong?

3

3 Answers

3
votes

You don't need a regex for this, you can just the startsWith method.

val abc_lines = lines.filter(x => x.startsWith("ABC"))
1
votes

Because method matches is not doing what you expect(please, see in documentation).

You can try this snippet to understand

val list = List("ABC", "DEF gibberish", "ABC text", "DEF random")
val reg = "^ABC".r
val lines: Seq[String] = list.filter(x => reg.pattern.matcher(x).matches)
println(lines.size)

Instead, you can use this code:

val list2 = List("ABC", "DEF gibberish", "ABC text", "DEF random")
val lines2: Seq[String] = list.filter(reg.findFirstIn(_).isDefined)
println(lines2.size)

You can find more info here - Matching against a regular expression in Scala

1
votes

you can use findFirstIn method of regex as following

val abc_lines = lines.filter(x => "^ABC".r.findFirstIn(x) == Some("ABC"))

which should give you the correct result.

doing as the following would give you Task not serializable error message in spark

val reg = "^ABC".r
val abc_lines = lines.filter(x => reg.findFirstIn(x) == Some("ABC"))