2
votes

Below is an example of using Curve_Fit from Scipy based on a linear equation. My understanding of Curve Fit in general is that it takes a plot of random points and creates a curve to show the "best fit" to a series of data points. My question is using scipy curve_fit it returns:

"Optimal values for the parameters so that the sum of the squared error of f(xdata, *popt) - ydata is minimized".

What exactly do these two values mean in simple English? Thanks!

import numpy as np
from scipy.optimize import curve_fit
# Creating a function to model and create data
def func(x, a, b):
    return a * x + b
# Generating clean data
x = np.linspace(0, 10, 100)
y = func(x, 1, 2)
# Adding noise to the data
yn = y + 0.9 * np.random.normal(size=len(x))
# Executing curve_fit on noisy data
popt, pcov = curve_fit(func, x, yn)
# popt returns the best fit values for parameters of
# the given model (func).
print(popt)
1

1 Answers

4
votes

You're asking SciPy to tell you the "best" line through a set of pairs of points (x, y).

Here's the equation of a straight line:

y = a*x + b

The slope of the line is a; the y-intercept is b.

You have two parameters, a and b, so you only need two equations to solve for two unknowns. Two points define a line, right?

So what happens when you have more than two points? You can't go through all the points. How do you choose the slope and intercept to give you the "best" line?

One way is to define "best" is to calculate the slope and intercept that minimize the square of the difference between each y value and the predicted y at that x on the line:

error = sum[(y(i) - (a*x(i) + b))^2]

It's an easy exercise if you know calculus: take the first derivatives of error w.r.t. a and b and set them equal to zero. You'll have two equations with two unknowns, a and b. You solve them to get the coefficients for the "best" line.