Plot best fit line with plotly

Question

I am using plotly's python library to plot a scatter graph of time series data. Eg data :

2015-11-11    1
2015-11-12    2
2015-11-14    4
2015-11-15    2
2015-11-21    3
2015-11-22    2
2015-11-23    3

Code in python:

df = pandas.read_csv('~/Data.csv', parse_dates=["date"], header=0)
df = df.sort_values(by=['date'], ascending=[True])
trace = go.Scatter(
            x=df['date'],
            y=df['score'],
            mode='markers'
)
fig.append_trace(trace, 2, 2)  # It is a subplot
iplot(fig)

Once the scatter plot is plotted, I want to plot a best fit line over this.

Does plotly provide this programmatically? It does from the webapp, but I did not find any documentation about how to do it programmatically. The line in the link is exactly what I want:

I found a description of how to add a Gaussian fit to a histogram in chapter 4 of the Python API user guide. Maybe it can help you. nbviewer.ipython.org/github/plotly/python-user-guide/blob/… — LaPriWa
I don't think plotly does but the cufflinks extension allows it with df.iplot(bestfit=True). — jmz

vestland vestland · Accepted Answer · 2019-12-04T12:20:56

Your provided code snippet is missing a fig definition. I prefer using plotly.graph_objs but the with setup below you can chose to show your figures using fig.show() or iplot(fig). You won't be able to just include an argument and get a best fit line automaticaly, but you sure can get this programmatically. You'll just need to add a couple of lines to you original setup and you're good to go.

Plot:

Complete code with sample data:

import pandas as pd
import datetime
import statsmodels.api as sm
import plotly.graph_objs as go
from plotly.offline import iplot

# sample data
df=pd.DataFrame({'date': {0: '2015-11-11',
                      1: '2015-11-12',
                      2: '2015-11-14',
                      3: '2015-11-15',
                      4: '2015-11-21',
                      5: '2015-11-22',
                      6: '2015-11-23'},
                     'score': {0: 1, 1: 2, 2: 4, 3: 2, 4: 3, 5: 2, 6: 3}})

df = df.sort_values(by=['date'], ascending=[True])

# data for time series linear regression
df['timestamp']=pd.to_datetime(df['date'])
df['serialtime']=[(d-datetime.datetime(1970,1,1)).days for d in df['timestamp']]

x = sm.add_constant(df['serialtime'])
model = sm.OLS(df['score'], x).fit()
df['bestfit']=model.fittedvalues

# plotly setup
fig=go.Figure()

# source data
fig.add_trace(go.Scatter(x=df['date'],
                         y=df['score'],
                         mode='markers',
                         name = 'score')
             )

# regression data
fig.add_trace(go.Scatter(x=df['date'],
                         y=df['bestfit'],
                         mode='lines',
                         name='best fit',
                         line=dict(color='firebrick', width=2)
                        ))

iplot(fig)

Some details:

Time series often present certain issues for linear OLS estimation. The format of the dates themselves can be challenging, so in this case it would be tempting to use the index of your dataframe as an independent variable. But since your dates are not continuous, simply replacing them with a continous series would result in erroneous regression coefficients. I often find it best to use a serialized integer array to represent time series data, meaning that each date is represented by an integer which in turn is the count ouf days from some epoch. In this case 01.01.1970.

And that's exactly what I'm doing here:

df['timestamp']=df['datetime'] = pd.to_datetime(df['date'])
df['serialtime'] = [(d- datetime.datetime(1970,1,1)).days for d in df['timestamp']]

Here's a plot that illustrates the effects on your OLS estimates by using the wrong data:

Plot best fit line with plotly

1 Answers