0
votes

I'm trying to use structured streaming in spark against a local kafka topic.

First I start zookeeper and kafka:

write-host -foregroundcolor green "starting zookeeper..."
start "$KAFKA_ROOT\bin\windows\zookeeper-server-start.bat" "$KAFKA_ROOT\config\zookeeper.properties"

write-host -foregroundcolor green "starting kafka..."
start "$KAFKA_ROOT\bin\windows\kafka-server-start.bat" "$KAFKA_ROOT\config\server.properties"

Then I start the shell like so:

& "$SPARK_ROOT\bin\spark-shell.cmd" --packages "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1"

Then I execute this scala command:

val ds = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "test").load()

Which should just work however I get this error:

org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-;

Every search result says something about using winutils to set permissions so I tried those answers and this is the output:

C:\>winutils chmod 777 \tmp\hive

C:\>winutils chmod 777 C:\tmp\hive

C:\>winutils ls C:\tmp\hive
drwxrwxrwx 1 DOMAIN\user DOMAIN\Domain Users 0 Jun 21 2018 C:\tmp\hive

Looks good but the same exception still occurs.

%HADOOP_HOME% is correctly set to D:\dependencies\hadoop and D:\dependencies\hadoop\bin\winutils.exe exists.

What am I missing here? I've gone through over a dozen posts here and there but the solution isn't working for me and I don't know how to debug it.

2

2 Answers

0
votes

So after pulling hairs out for two days, of course it was something simple. If you are calling C:\spark\bin\spark-shell from a working directory on another drive (eg. D:), then the permissions that you need to update are actually:

C:\Users\user>winutils ls D:\tmp\hive
d--------- 1 DOMAIN\user DOMAIN\Domain Users 0 Jun 25 2018 D:\tmp\hive

C:\Users\user>winutils chmod -R 777 D:\tmp\hive

C:\Users\user>winutils ls D:\tmp\hive
drwxrwxrwx 1 DOMAIN\user DOMAIN\Domain Users 0 Jun 25 2018 D:\tmp\hive

There is no command I could find, nor config I could see, or page on the environment config in the web UI that would should what the current hive directory is.

0
votes

You need to set expected access mode on HDFS directory, not on directory on local FS.

You would need to use hadoop fs -chmod ... command for that. Also, do not forget to check that user under which your spark application is launched has ability to write to /tmp/hive either explicitly or via being in group allowed to write to this directory.

You may refer to official documentation on HDFS file permissions.

Update:

so, if you bumped into the same issue, you need to use winutils as mentioned in original post or in another similar questions, but directory in question may be located not on disk C: and you need to adjust path to temporary directory with correct drive letter.