0
votes

I am working on a cluster with a master and 2 slave nodes.

I run '''spark-submit --class 'PropertyTables' --master spark://172.17.67.122:7077 /etc/rdfbenchmarkingproject_2.12-0.1.jar'''

And the error is : org.apache.spark.SparkException: Could not execute broadcast in 300 secs


In the PropertyTables class, I am loading 3 CSV files (1G, 1G, 100MB sizes) and run the following JOIN query on them:

SELECT DISTINCT
    D.title AS title
FROM
    Publication P
    JOIN Document D  ON D.document=P.publication
    JOIN Reference R ON P.publication=R.cited
WHERE
    P.publication NOT IN (
        SELECT cited
        FROM Reference R2
        WHERE R2.document NOT IN (
            SELECT cited FROM Reference R3
        )
    ) 

I have tried already proposed solutions:

  • persist the 3 tables.

Result: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

  • add --conf spark.sql.autoBroadcastJoinThreshold=-1

Result: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

  • add --conf spark.sql.broadcastTimeout=7200

Result: java.util.concurrent.TimeoutException: Futures timed out after [7200 seconds]

Could someone help, please?

1

1 Answers

0
votes

You can try to replace NOT IN with NOT exists in the sql. This is helpful to reduce the excution time.