I am working on a cluster with a master and 2 slave nodes.
I run '''spark-submit --class 'PropertyTables' --master spark://172.17.67.122:7077 /etc/rdfbenchmarkingproject_2.12-0.1.jar'''
And the error is : org.apache.spark.SparkException: Could not execute broadcast in 300 secs
In the PropertyTables class, I am loading 3 CSV files (1G, 1G, 100MB sizes) and run the following JOIN query on them:
SELECT DISTINCT
D.title AS title
FROM
Publication P
JOIN Document D ON D.document=P.publication
JOIN Reference R ON P.publication=R.cited
WHERE
P.publication NOT IN (
SELECT cited
FROM Reference R2
WHERE R2.document NOT IN (
SELECT cited FROM Reference R3
)
)
I have tried already proposed solutions:
- persist the 3 tables.
Result: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
- add --conf spark.sql.autoBroadcastJoinThreshold=-1
Result: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
- add --conf spark.sql.broadcastTimeout=7200
Result: java.util.concurrent.TimeoutException: Futures timed out after [7200 seconds]
Could someone help, please?