I'm using Pyspark on my local machine. I have a spark dataframe with 4.5 million rows and approximately 30,000 different stocks. I need to calculate the percentage change for each stock over time. I've already ran orderBy so that all the stocks are grouped together (as shown in the example below).
A simplified example dataframe is below.
df = spark.read.csv("stock_price.txt", header=True, inferSchema=True)
df.show()
**Company** **Price**
Company_A 100
Company_A 103
Company_A 105
Company_A 107
Company_B 23
Company_B 25
Company_B 28
Company_B 30
My desired output would be something like this
**Company** **Price** **%_Change**
Company_A 100 0
Company_A 103 3%
Company_A 105 2%
Company_A 107 2%
Company_B 23 0
Company_B 25 9%
Company_B 28 12%
Company_B 30 7%
The trick (in my opinion) is setting up a code that can do two things: 1) identify each time a new stock is listed 2) start calculating percentage change on the second observation for that stock and continue calculating the percentage change until the last observation. It needs to start on the second observation since there can't be a percentage change until the second observation occurs.