1
votes

the story:

I'm facing a problem in gam where only 2 input-variable should be considered:

x = relative price (%) the customer paid for the product given the entry price for the club

b = binary, if the customer has to pay the product (VIPs get it for free)

the output-variable is

y = if the customer took the product

and this sims the data:

require(mgcv)
require(data.table)
set.seed(2017)
y <- sample(c(0, 1), 100, replace=T)
x <- rgamma(100, 3, 3)
b <- as.factor(ifelse(x<.5, 0, 1))
dat <- as.data.table(list(y=y, x=x, b=b))
dat[b=="0",x:=0]
plot(dat$x, dat$y, col=dat$b)

relative price

as you can see in the plot, customers who hadn't pay for the product have a relative price for the product at 0%, others have the relative prices between .5% and 3.5%

here comes the problem:

I want to model one dummy effect for b and a smooth effect for x (certainly only for those who has to pay), so I use b also as a by-variable in x:

mod <- bam(y~b+s(x, by=b), data=dat, family=binomial(link="logit"))
summary(mod)
par(mfrow=c(1,2))
plot(mod)

smooth effects

my question is:

a. why can you still see rug by s(x, b=1) at 0%, wouldn't it makes more sense if mgcv only consider those who has to pay? does this problem has s.th to do with the knots?

b. as you can see in the summary, the dummy effect is estimated as NA, this might has to do with the fact that the information of b was totally used in as by-variable in s(x) so the dummy b itself has no more information to give? how can I overcome this problem, in other words: is there a option to model a smooth term only for a subset of the data and make mgcv actually only use this subset to fit?

1
So run the smoother using the subset argument to gam, then store its predicted value (from predict) in the data; then use that as a predictor in your next call to gam, and include the dummy too.user3603486
@dash2 that's interesting idea, thanks. I understand the subset-fitting, but how do you legit this into the whole dataset again? Let's say there are also other covariates than x, do you take these other covariates also into subset-fitting, and what do you do if these other covariates are estimated very differently in the whole dataset?97m423
@李哲源ZheyuanLi since I got less than 15 reputation-points I can not vote :*97m423
@97m423 I think to use the prediction for the whole dataset you'd implicitly set the prediction value to 0 for values outside the subset. For example you could do something like b %in% mysubset * s(x,by=b).user3603486

1 Answers

2
votes

Your question is conceptually as same as How can I force dropping intercept or equivalent in this linear model?. You want to contrast b, rather than using all its levels.

In GAM setting, you want:

dat$B <- as.numeric(dat$b) - 1
y ~ b + s(x, by = B)

For factor by smooth, mgcv does not apply contrast to by, if this factor is unordered. This is generally appealing as often we want a smooth for each factor level. It is thus your responsibility to use some trick to get what you want. What I did in above is to coerce this two-level factor b to a numeric B, with the level you want to omit being numerically 0. Then use numerical 'by' B. This idea can not be extended to factors of more levels.


If your factor by has more than 2 levels and you still want to enforce a contrast, you need to use an ordered factor. For example, you can do

dat$B <- ordered(dat$b)
y ~ b + s(x, by = B)

Read more on 'by' variables from ?gam.models.