How to make good reproducible Apache Spark examples

Question

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth.

Perhaps part of the problem is that people just don't know how to easily create an MCVE for spark-dataframes. I think it would be useful to have a spark-dataframe version of this pandas question as a guide that can be linked.

So how does one go about creating a good, reproducible example?

I think this can be generalized to Spark Dataframe. What do you think? — Alper t. Turker
Yes, that makes sense. I made it python specific because that's what I know, but I like the idea of adding other language examples. What do you think is the best way? Add another answer or edit the existing one with examples for each language? — pault
API is very uniform so single answer is probably enough. Let's keep it DRY :) — Alper t. Turker
Good question ! I like the answers. I'm adding it to the tag doc but I'll change the title, because it's not just for pyspark :) — eliasah
[to be removed:] have created a feature request @ Meta for the pyspark & sparkr tags to trigger automatically syntax highlighting for the respective languages: meta.stackoverflow.com/questions/362624/… - upvotes most welcome — desertnaut

pault pault · Accepted Answer · 2018-01-24T16:24:19

###Provide small sample data, that can be easily recreated. At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.

I have the following dataframe:

+-----+---+-----+----------+
|index|  X|label|      date|
+-----+---+-----+----------+
|    1|  1|    A|2017-01-01|
|    2|  3|    B|2017-01-02|
|    3|  5|    A|2017-01-03|
|    4|  7|    B|2017-01-04|
+-----+---+-----+----------+

which can be created with this code:

df = sqlCtx.createDataFrame(
    [
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ],
    ('index', 'X', 'label', 'date')
)

###Show the desired output. Ask your specific question and show us your desired output.

How can I create a new column 'is_divisible' that has the value 'yes' if the day of month of the 'date' plus 7 days is divisible by the value in column'X', and 'no' otherwise?

Desired output:

+-----+---+-----+----------+------------+
|index|  X|label|      date|is_divisible|
+-----+---+-----+----------+------------+
|    1|  1|    A|2017-01-01|         yes|
|    2|  3|    B|2017-01-02|         yes|
|    3|  5|    A|2017-01-03|         yes|
|    4|  7|    B|2017-01-04|          no|
+-----+---+-----+----------+------------+

###Explain how to get your output. Explain, in great detail, how you get your desired output. It helps to show an example calculation.

For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is 'yes'.

Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is 'no'.

###Share your existing code. Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.

(*You can leave out the code to create the spark context, but you should include all imports.)

I know how to add a new column that is date plus 7 days but I'm having trouble getting the day of the month as an integer.

from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))

###Include versions, imports, and use syntax highlighting

Full details in this answer written by desertnaut.

###For performance tuning posts, include the execution plan

Full details in this answer written by Alper t. Turker.
It helps to use standardized names for contexts.

###Parsing spark output files

MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.

###Other notes.

Be sure to read how to ask and How to create a Minimal, Complete, and Verifiable example first.
Read the other answers to this question, which are linked above.
Have a good, descriptive title.
Be polite. People on SO are volunteers, so ask nicely.

How to make good reproducible Apache Spark examples

4 Answers

Performance tuning

Execution Plan

Mode and cluster information

Timing information

Use standarized names for contexts

Provide type information (Scala)

Include your Spark version

Include all your imports

Include code highlighting