Check equality for two Spark DataFrames in Scala

Question

I'm new to Scala and am having problems writing unit tests.

I'm trying to compare and check equality for two Spark DataFrames in Scala for unit testing, and realized that there is no easy way to check equality for two Spark DataFrames.

The C++ equivalent code would be (assuming that the DataFrames are represented as double arrays in C++):

    int expected[10][2];
    int result[10][2];
    for (int row = 0; row < 10; row++) {
        for (int col = 0; col < 2; col++) {
            if (expected[row][col] != result[row][col]) return false;
        }
    }

The actual test would involve testing for equality based on the data types of the columns of the DataFrames (testing with precision tolerance for floats, etc).

It seems like there's not an easy way to iteratively loop over all the elements in the DataFrames using Scala and the other solutions for checking equality of two DataFrames such as df1.except(df2) do not work in my case as I need to be able to provide support for testing equality with tolerance for floats and doubles.

Of course, I could try to round all the elements beforehand and compare the results afterwards, but I would like to see if there are any other solutions that would allow me to iterate through the DataFrames to check for equality.

How big are your dataframes ? If they are not so big, you could sort/collect them and then easily compare them. — cheseaux
Since those are unit-test data frames, those should be quite small. Just collect them into a List and compare. — sarveshseri
Yeah, my test currently collects the data frames into a list and compares them, but I was hoping to create tools that could also test on bigger data frames as well. I'm guessing that there's no easy way of accomplishing this? — codeinstyle
*** Asked 3 years, 4 months ago Active 5 months ago Viewed 7k times --- YET still no Answer accepted ... — Yordan Georgiev

Yordan Georgiev Yordan Georgiev · Accepted Answer · 2017-11-14T08:58:11

import org.scalatest.{BeforeAndAfterAll, FeatureSpec, Matchers}

outDf.collect() should contain theSameElementsAs (dfComparable.collect())
# or ( obs order matters ! )

// outDf.except(dfComparable).toDF().count should be(0)
outDf.except(dfComparable).count should be(0)

Check equality for two Spark DataFrames in Scala

3 Answers