5
votes

I am building the integration tests which would read the data generated after the previous test cases and check it with expected result. When I run the tests the generated data is not visible in the directory in the next test cases, although it resides there. When I re-run the tests the data is picked up and read from the directory. What could be the reason of it? Could there be a problem in the sequence of the tests execution?

Here is how my tests look like:

class LoaderSpec extends Specification{

     sequential

      "Loader" should {
        "run job from assembled .jar" in {
          val res = "sh ./src/test/runLoader.sh".!
          res must beEqualTo(0)
        }

        "write results to the resources" in {
          val resultsPath = "/results/loader_result"
          resourcesDirectoryIsEmpty(resultsPath) must beFalse
        }

        "have actual result same as expected one" in {
          val expected: Set[String] = readFilesFromDirs("source/loader_source")
          println(expected)

          val result: Set[String] = readFilesFromDirs("/results/loader_result")
          println(result)

          expected must beEqualTo(result)
        }
      }
}

The first tests succeds and the next 2 tests fail as the data is not found. When I re-run the same test suite without any changes - all the tests succeed.

The runLoader.sh script:

$SPARK_HOME/bin/spark-submit \
 --class "loader.LoaderMain" \
 \
 --conf "spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem" \
 --conf "spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS" \
 --conf "spark.hadoop.fs.gs.project.id=loader-files" \
 --conf "spark.hadoop.fs.gs.implicit.dir.repair.enable=false" \
 \
 --conf "spark.loader.Config.srcPaths=;src/test/resources/source/loader" \
 --conf "spark.loader.Config.dstPath=src/test/resources/results/loader_result" \
 --conf "spark.loader.Config.filesPerPartner=10" \
 \
 --conf "spark.shuffle.memoryFraction=0.4" \
 --conf "spark.task.maxFailures=20" \
 --conf "spark.executor.extraJavaOptions=${EXTRA_JVM_FLAGS}" \
 \
 --master "local[8]" \
 --driver-memory 1500M \
 --driver-java-options "${EXTRA_JVM_FLAGS}" \
 $(find "$(pwd)"/target/scala-2.11 -name 'loader-assembly-*.jar')
2
Can you post the contents of runLoader.sh? - crenshaw-dev
This might be more of a Spark question. It feels as though spark-submit is exiting before the work in loader.LoaderMain is complete. But from what I've read, Spark should wait for the class to finish running before exiting. - crenshaw-dev

2 Answers

3
votes

I have tried to change the way I read files. Turns out reading from resources might produce this error as the contents are read before all the tests. Although when I read the data simply from the directory, the contents get updated and this error does not occur. This is the way I have changed the tests:

"write results to the resources" in {
  val resultsPath = "./src/dockerise/resource/results/loader_result"
  resourcesDirectoryIsEmpty(resultsPath) must beFalse
}
1
votes

I have two ideas that might help you solve your issue:

  1. Make use of eventually matcher to retry assertion multiple times -- make sure you set up reasonable timeout and amount of retries, otherwise your test might become flaky.

  2. Use !! instead of ! to see the console output -- might give you insights on possible async tasks being performed after your spark-submit run.

Hope it helps!