I am doing the curve fitting by p-splines(with the basis b-splines) in a 1D dataset. I am struggling that in almost every regression model, it is essential to divide the dataset into a validation set and a testing set, to see if the data fits well or over-fitting. But in p-splines(#mgcv# in R), it is using GCV or AIC to control the location and numbers of knots to prevent the over-fit and generate the best lambda(or knots). In this situation, I am not quite sure if it needs to divide the dataset. If it needs, what's the difference between dividing dataset and not dividing dataset? Cause I read the paper and it said the penalty would help to prevent the over-fit.
And how to test if the fit is good or not? Due to the GCV only tells that I've chosen the best result in this model.
I tried using #mgcv# packages in R to do the fitting. And the codes are as follows:
dataset <- read.csv("bsplinesforR.csv", header = TRUE)
x <- dataset[,2]
y <- dataset[,4]
vdist <- hdist <- 0.2
layout( matrix(1:4, 2, 2, byrow=TRUE), widths=c(10,10), heights=c(10, 10))
par(mar= c(vdist, 4, 3, hdist))
library(gamlss)
gamlss.ps<- gamlss( y~pb(x), control=gamlss.control(trace=FALSE))
detach("package:gamlss")
library(mgcv)
mgcv.ps <- gam(y~s(x,bs="ps"), method = "GCV.Cp")
t.gl <- predict(gamlss.ps)
t.g <- predict(mgcv.ps)
lines(x, t.gl,col=2, lwd=2)
lines(x, t.g,col=3, lwd=2)
legend(5, 600, c("Real", "gamlss", "mgcv"), col=c(1, 2, 3), lty=c(2, 1, 1), bty="n")
layout(matrix(1, 1, 1))
legend("topright", "p-splines: default values", col=1 ,lty=0, bty="n")
Family: gaussian
Link function: identity
Formula:
y ~ s(x, bs = "ps")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 327.182 4.945 66.17 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 5.294 6.123 214.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.961 Deviance explained = 96.4%
GCV = 1518.4 Scale est. = 1344.7 n = 55