the story:
I'm facing a problem in gam where only 2 input-variable should be considered:
x = relative price (%) the customer paid for the product given the entry price for the club
b = binary, if the customer has to pay the product (VIPs get it for free)
the output-variable is
y = if the customer took the product
and this sims the data:
require(mgcv)
require(data.table)
set.seed(2017)
y <- sample(c(0, 1), 100, replace=T)
x <- rgamma(100, 3, 3)
b <- as.factor(ifelse(x<.5, 0, 1))
dat <- as.data.table(list(y=y, x=x, b=b))
dat[b=="0",x:=0]
plot(dat$x, dat$y, col=dat$b)
as you can see in the plot, customers who hadn't pay for the product have a relative price for the product at 0%, others have the relative prices between .5% and 3.5%
here comes the problem:
I want to model one dummy effect for b and a smooth effect for x (certainly only for those who has to pay), so I use b also as a by-variable in x:
mod <- bam(y~b+s(x, by=b), data=dat, family=binomial(link="logit"))
summary(mod)
par(mfrow=c(1,2))
plot(mod)
my question is:
a. why can you still see rug by s(x, b=1) at 0%, wouldn't it makes more sense if mgcv only consider those who has to pay? does this problem has s.th to do with the knots?
b. as you can see in the summary, the dummy effect is estimated as NA, this might has to do with the fact that the information of b was totally used in as by-variable in s(x) so the dummy b itself has no more information to give? how can I overcome this problem, in other words: is there a option to model a smooth term only for a subset of the data and make mgcv actually only use this subset to fit?
predict
) in the data; then use that as a predictor in your next call togam
, and include the dummy too. – user3603486b %in% mysubset * s(x,by=b)
. – user3603486