0
votes

I have following set of data (pandas.DataFrame) which I would like to use scipy.interpolate.UnivariateSpline to fit. Let's call the data data.

Date
2018-04-02 09:00:00     16249
2018-04-02 10:00:00     45473
2018-04-02 11:00:00     32050
2018-04-02 12:00:00     35898
2018-04-02 13:00:00     21577
2018-04-02 14:00:00     30545
2018-04-02 15:00:00     60925
2018-04-02 16:00:00     47124
2018-04-03 09:00:00     18534
2018-04-03 10:00:00     36064
2018-04-03 11:00:00     32387
2018-04-03 12:00:00     15903
2018-04-03 13:00:00     22291
2018-04-03 14:00:00     26367
2018-04-03 15:00:00     66269
2018-04-03 16:00:00     38478
2018-04-04 09:00:00     15803
2018-04-04 10:00:00     22511
2018-04-04 11:00:00     33123
2018-04-04 12:00:00     21000
2018-04-04 13:00:00     23132
2018-04-04 14:00:00     39270
2018-04-04 15:00:00    102544
2018-04-04 16:00:00    143421
2018-04-04 17:00:00       200
2018-04-05 09:00:00     23377
2018-04-05 10:00:00     52089
2018-04-05 11:00:00     99298
2018-04-05 12:00:00     24627
2018-04-05 13:00:00     33467
2018-04-05 14:00:00     26498
2018-04-05 15:00:00    114794
2018-04-05 16:00:00     44904
2018-04-06 09:00:00     12180
2018-04-06 10:00:00     41658
2018-04-06 11:00:00     64066
2018-04-06 12:00:00     12517
2018-04-06 13:00:00     12610
2018-04-06 14:00:00     43544
2018-04-06 15:00:00     65533
2018-04-06 16:00:00    123885
2018-04-09 09:00:00     13425
2018-04-09 10:00:00     38354
2018-04-09 11:00:00     59491
2018-04-09 12:00:00     21402
2018-04-09 13:00:00     24550
2018-04-09 14:00:00     25189
2018-04-09 15:00:00     67751
2018-04-09 16:00:00     16071
2018-04-10 09:00:00     35587
2018-04-10 10:00:00     58667
2018-04-10 11:00:00     41831
2018-04-10 12:00:00     35196
2018-04-10 13:00:00     22611
2018-04-10 14:00:00     23070
2018-04-10 15:00:00     40819
2018-04-10 16:00:00     20337
2018-04-11 09:00:00      7962
2018-04-11 10:00:00     23982
2018-04-11 11:00:00     21794
2018-04-11 12:00:00     16835
2018-04-11 13:00:00     16821
2018-04-11 14:00:00     13270
2018-04-11 15:00:00     34954
2018-04-11 16:00:00     15772
2018-04-12 09:00:00      8587
2018-04-12 10:00:00     47950
2018-04-12 11:00:00     24742
2018-04-12 12:00:00     16743
2018-04-12 13:00:00     21917
2018-04-12 14:00:00     43272
2018-04-12 15:00:00     50630
2018-04-12 16:00:00    104656
2018-04-13 09:00:00     15282
2018-04-13 10:00:00     30304
2018-04-13 11:00:00     65737
2018-04-13 12:00:00     17467
2018-04-13 13:00:00     10439
2018-04-13 14:00:00     19836
2018-04-13 15:00:00     52051
2018-04-13 16:00:00     99462

what I have done so far is:

import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate as interp

x = [i for i in range(1, data.size+1)]  # this gives x as an array from 1 to 82.

spl = interp.UnivariateSpline(x, data.values, s=0.5)
xx = np.linspace(min(x), max(x), 1000)  # 1000 is an arbitrary number here.
plt.plot(x, data.values, 'bo')
plt.plot(xx, spl(xx), 'r')
plt.show()

# the plot is below and it seems to be very linear and does not look like a cubic spline at all. Cubic Spline is the default.

enter image description here

when I run spl against x, others remain unchanged, which is:

plt.plot(x, spl(x), 'r')

I get following:

the only different is the y axis is topped at 14,000, which seems to mean the previous plot showed some degree of curvature. (or not?)

enter image description here

I am not sure what I am missing here but I apparently missed something. I am still very new to spline fitting in python generally.

can you tell me how I can correctly spline fit my time series above?

EDIT

upon comment from you, I wanted to add another plot to hopefully explain myself a bit better. I didn't really mean it is linear but I couldn't find a better word. To illustrate,

xxx = [10,20,40,60,80]
plt(x, data.values, 'bo')
plt(xx, sp(xx), 'r')

plt.show()

I think below plot looks reasonably linear-ish in my sense. I am guessing, probably my question should be, how scipy.UnivariateSpline really works?

does it only show the plot for the values evaluated at the points we supplied (e.g. for this plot it is xxx) ?

enter image description here

I was expecting a much smoother plot with decent curvature demonstrated. this question's answer is showing a plot that I would expect; it looks more like a plot that piece-wise cubic functions would generate, whereas mine looks, to me, and compared to that plot, linear-ish (or first order if it is more appropriate.)

1
Why do you say the plot looks linear? In the first plot, zoom in on the interval [20 <= x <= 25]. Does that really look linear to you? - Warren Weckesser
The spline fit looks correct and is performed correctly. I don't quite understand the problem with it. So when you say "I apparently missed something", what do you mean? What is wrong with the plot and how would you like it to look instead? - ImportanceOfBeingErnest
@WarrenWeckesser sorry for the inaccuracy. I didn't really intend to say 'linear'. I have added another plot to help explain my question. - stucash
@ImportanceOfBeingErnest thanks for your time. I probably was expecting the wrong thing from UnivariateSpline. I have added a link to a plot that was more like what I wanted. I typically use r for spline when I use gam to do spline in r, it looks correct and is what I expected as well. - stucash

1 Answers

1
votes

The data set you have looks more like Rexthor, the dog-bearer than something that a smooth curve can follow. You don't have an issue with SciPy; you have an issue with data.

By increasing the parameter s you can get progressively smoother plots that deviate further and further from the data, eventually approaching the cubic polynomial that is the "best" least-squares fit for the data. But here "best" means "very bad, probably worthless". A smooth curve can be useful to display a pattern that the data already follows. If the data does not follow a smooth pattern, one should not draw a curve for the sake of drawing. The data points on the first plot should just be presented as is, without any connecting or approximating curves.

The data comes from hourly reading taken from 9:00 to 16:00 (with one stray 17:00 value mixed it - throw it out.) This structure matters. Do not pretend that Tuesday 9:00 is what happens one hour after Monday 16:00.

The data can be meaningfully summarized by daily totals

Day         Total
2018-04-02  289841
2018-04-03  256293
2018-04-04  401004
2018-04-05  419054
2018-04-06  375993
2018-04-09  266233
2018-04-10  278118
2018-04-11  151390
2018-04-12  318497
2018-04-13  310578

and by hourly averages (average number of events at 9:00, across all days, etc).

Hour        Average
9:00:00     16698.6
10:00:00    39705.2
11:00:00    47451.9
12:00:00    21758.8
13:00:00    20941.5
14:00:00    29086.1
15:00:00    65627
16:00:00    65411

In these things we can maybe observe some pattern. Here is the hourly one:

hourly_averages = np.array([16698.6, 39705.2, 47451.9, 21758.8, 20941.5, 29086.1, 65627, 65411])
hours = np.arange(9, 17)
hourly_s = 0.1*np.diff(hourly_averages).max()**2
hourly_spline = interp.UnivariateSpline(hours, hourly_averages, s=hourly_s)
xx = np.linspace(min(hours), max(hours), 1000)  # 1000 is an arbitrary number here.
plt.plot(hours, hourly_averages, 'bo')
plt.plot(xx, hourly_spline(xx), 'r')
plt.show()

hours

The curve shows the lunch break and the end-of-day rush. My choice of s as 0.1*np.diff(hourly_averages).max()**2 is not canonical, but it recognizes the fact that s scales as the square of the residuals. (Documentation). I'll use the same choice for daily averages:

daily_totals = np.array([289841, 256293, 401004, 419054, 375993, 266233, 278118, 151390, 318497, 310578])
days = np.arange(len(daily_totals))
daily_s = 0.1*np.diff(daily_totals).max()**2
daily_spline = interp.UnivariateSpline(days, daily_totals, s=daily_s)
xx = np.linspace(min(days), max(days), 1000)  # 1000 is an arbitrary number here.
plt.plot(days, daily_totals, 'bo')
plt.plot(xx, daily_spline(xx), 'r')
plt.show()

daily

This is less useful. Maybe we need a longer period of observations. Maybe we should not pretend that Monday comes after Friday. Maybe averages should be taken for each day of week to uncover a weekly pattern, but with only two weeks there is not enough to play with.


Technical details: the method UnivariateSpline chooses as few knots as possible so that a certain weighed sum of squared deviations from the data is at most s. With large s this will mean very few knots, until none remain, and we get a single cubic polynomial. How large s needs to be depends on the amount of oscillation in the vertical direction, which is extremely high in this example.