7
votes

I am trying to modify a large PostScript file in Scala (some are as large as 1GB in size). The file is a group of batches, with each batch containing a code that represents the batch number, number of pages, etc.

I need to:

  1. Search the file for the batch codes (which always start with the same line in the file)
  2. Count the number of pages until the next batch code
  3. Modify the batch code to include how many pages are in each batch.
  4. Save the new file in a different location.

My current solution uses two iterators (iterA and iterB), created from Source.fromFile("file.ps").getLines. The first iterator (iterA) traverses in a while loop to the beginning of a batch code (with iterB.next being called each time as well). iterB then continues searching until the next batch code (or the end of the file), counting the number of pages it passes as it goes. Then, it updates the batch code at iterA's position, an the process repeats.

This seems very non-Scala-like and I still haven't designed a good way to save these changes into a new file.

What is a good approach to this problem? Should I ditch iterators entirely? I'd preferably like to do it without having to have the entire input or output into memory at once.

Thanks!

3

3 Answers

3
votes

You could probably implement this with Scala's Stream class. I am assuming that you don't mind holding one "batch" in memory at a time.

import scala.annotation.tailrec
import scala.io._

def isBatchLine(line:String):Boolean = ...

def batchLine(size: Int):String = ...

val it = Source.fromFile("in.ps").getLines
// cannot use it.toStream here because of SI-4835
def inLines = Stream.continually(i).takeWhile(_.hasNext).map(_.next)

// Note: using `def` instead of `val` here means we don't hold
// the entire stream in memory
def batchedLinesFrom(stream: Stream[String]):Stream[String] = {
  val (batch, remainder) = stream span { !isBatchLine(_) }
  if (batch.isEmpty && remainder.isEmpty) { 
    Stream.empty
  } else {
    batchLine(batch.size) #:: batch #::: batchedLinesFrom(remainder.drop(1))
  }
}

def newLines = batchedLinesFrom(inLines dropWhile isBatchLine)

val ps = new java.io.PrintStream(new java.io.File("out.ps"))

newLines foreach ps.println

ps.close()
1
votes

If you not in pursuit of functional scala enlightenment, I'd recommend a more imperative style using java.util.Scanner#findWithinHorizon. My example is quite naive, iterating through the input twice.

val scanner = new Scanner(inFile)

val writer = new BufferedWriter(...)

def loop() = {
  // you might want to limit the horizon to prevent OutOfMemoryError
  Option(scanner.findWithinHorizon(".*YOUR-BATCH-MARKER", 0)) match {
    case Some(batch) =>
      val pageCount = countPages(batch)
      writePageCount(writer, pageCount)
      writer.write(batch)        
      loop()

    case None =>
  }
}

loop()
scanner.close()
writer.close()
0
votes

May be you can use span and duplicate effectively. Assuming the iterator is positioned on the start of a batch, you take the span before the next batch, duplicate it so that you can count the pages, write the modified batch line, then write the pages using the duplicated iterator. Then process next batch recursively...

def batch(i: Iterator[String]) {
  if (i.hasNext) {
    assert(i.next() == "batch")
    val (current, next) = i.span(_ != "batch")
    val (forCounting, forWriting) = current.duplicate
    val count = forCounting.filter(_ == "p").size
    println("batch " + count)
    forWriting.foreach(println)
    batch(next)
  }
}

Assuming the following input:

val src = Source.fromString("head\nbatch\np\np\nbatch\np\nbatch\np\np\np\n")

You position the iterator at the start of batch and then you process the batches:

val (head, next) = src.getLines.span(_ != "batch")
head.foreach(println)
batch(next)

This prints:

head
batch 2
p
p
batch 1
p
batch 3
p
p
p