Streaming a directory with Spark on Windows 7

Question

I am running Spark 1.6.1 with Python 2.7 on Windows 7. The root scratch dir: /tmp/hive on HDFS is writable and my current permissions are: rwxrwxrwx (using winutils tools).

I want to stream files from a directory. According to the doc, the function textFileStream(directory):

Create an input stream that monitors a Hadoop-compatible file system for new files and reads them as text files. Files must be wrriten to the monitored directory by “moving” them from another location within the same file system. File names starting with . are ignored.

When I launch Spark Streaming command:

lines = ssc.textFileStream(r"C:/tmp/hive/")
counts = lines.flatMap(lambda line: line.split(" "))\
                  .map(lambda x: (x, 1))\
                  .reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()

and then create the files to stream in my directory, nothing happens.

I also tried this:

lines = ssc.textFileStream("/tmp/hive/")

and

lines = ssc.textFileStream("hdfs://tmp/hive/")

which is HDFS path related, but nothing happens again.

Do I do something wrong?

r3stle55 r3stle55 · Accepted Answer · 2017-01-07T00:14:57

Try using "file:///C:/tmp/hive" as a directory on Windows, worked for me on Windows 8 with Spark 1.6.3 but I had to fiddle a bit with file name and content before I made it work. I also tried with other paths so can confirm that it works the same way with paths which are not related in any way to winutits, e.g. you can use "file:///D:/someotherpath" if you have your data there

It is not straightforward though, I had a file in the monitored directory and did few content and file name changes before it got picked up, and then at some point it stopped reacting to my changes and getting picked up so results are not consistent. Guess it's a Windows thing.

I know it works so every time I try I know have to be patient and try few name changes before it gets picked up but that's only good to prove it works, obviously not good for anything else.

One other thing i was doing is using Unix eof instead of Windows eof in files but cannot assert it is required

Streaming a directory with Spark on Windows 7

1 Answers