1
votes

For my Bachelor's thesis I am trying to apply a linear median regression model on constant sum data from a survey (see formula from A.Blass (2008)). It is an attempt to recreate the probability elicitation approach proposed by A. Blass et al (2008) - Using Elicited Choice Probabilities to Estimate Random Utility Models: Preferences for Electricity Reliability

My dependent variable is the log-odds transformation of the constant sum allocations. Calculated using the following formula:

PE_raw <- PE_raw %>% group_by(sys_RespNum, Task) %>% mutate(LogProb = c(log(Response[1]/Response[1]),
                                                         log(Response[2]/Response[1]),
                                                         log(Response[3]/Response[1])))

My independent variables are delivery costs, minimum order quantity and delivery window, each categorical variables with levels 0, 1, 2 and 3. Here, level 0 represent the none-option.

Data snapshot

I tried running the following quantile regression (using R's quantreg package):

LAD.factor <- rq(LogProb ~ factor(`Delivery costs`) + factor(`Minimum order quantity`) + factor(`Delivery window`) + factor(NoneOpt), data=PE_raw, tau=0.5)

However, I ran into the following error indicating singularity:

Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix

I ran a linear regression and applied R's alias function for further investigation. This informed me of three cases of perfect multicollinearity:

  • minimum order quantity 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - minimum order quantity 1 - minimum order quantity 2
  • delivery window 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - delivery window 1 - delivery window 2
  • NoneOpt = intercept - delivery costs 1 - delivery costs 2 - delivery costs 3

In hindsight these cases all make sense. When R dichotomizedthe categorical variables you get these results by construction as, delivery costs 1 + delivery costs 2 + delivery costs 3 = 1 and minimum order quantity 1 + minimum order quantity 2 + minimum order quantity 3 = 1. Rewriting gives the first formula.

It looks like a classic dummy trap. In an attempt to workaround this issue I tried to manually dichotomize the data and used the following formula:

LM.factor <- rq(LogProb ~ Delivery.costs_1 + Delivery.costs_2 + Minimum.order.quantity_1 + Minimum.order.quantity_2 + Delivery.window_1 + Delivery.window_2 + factor(NoneOpt), data=PE_dichomitzed, tau=0.5)

Instead of an error message I now got the following:

    Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique

When using the summary function:

 > summary(LM.factor)
Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose,  : 
  singular matrix in 'backsolve'. First zero in diagonal [2]
In addition: Warning message:
In summary.rq(LM.factor) : 153 non-positive fis

Is anyone familiar with this issue? I am looking for alternative solutions. Perhaps I am making mistakes using the rq() function, or the data might be misrepresented.

I am grateful for any input, thank you in advance.

Reproducible example

library(quantreg)

#### Raw dataset (PE_raw_SO) ####

# quantile regression (produces singularity error)
LAD.factor <- rq(
  LogProb ~ factor(`Delivery costs`) +
    factor(`Minimum order quantity`) + factor(`Delivery window`) +
    factor(NoneOpt),
  data = PE_raw_SO,
  tau = 0.5
) 

# linear regression to check for singularity
LM.factor <- lm(
  LogProb ~ factor(`Delivery costs`) +
    factor(`Minimum order quantity`) + factor(`Delivery window`) +
    factor(NoneOpt),
  data = PE_raw_SO
)
alias(LM.factor)

# impose assumptions on standard errors
summary(LM.factor, se = "iid")
summary(LM.factor, se = "boot")


#### Manually created dummy variables to get rid of
#### collinearity (PE_dichotomized_SO) ####
LAD.di.factor <- rq(
  LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
    Minimum.order.quantity_1 + Minimum.order.quantity_2 +
    Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
  data = PE_dichotomized_SO,
  tau = 0.5
)

summary(LAD.di.factor)  #backsolve error

# impose assumptions (unusual results)
summary(LAD.di.factor, se = "iid") 
summary(LAD.di.factor, se = "boot")

# linear regression to check for singularity
LM.di.factor <- lm(
  LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
    Minimum.order.quantity_1 + Minimum.order.quantity_2 +
    Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
  data = PE_dichotomized_SO
)
alias(LM.di.factor)

summary(LM.di.factor)  #regular results, all significant

Link to sample data + code: GitHub

1

1 Answers

1
votes

The Solution may be nonunique behaviour is not unusual when doing quantile regressions with dummy explanatory variables.

See, e.g., the quantreg FAQ:

The estimation of regression quantiles is a linear programming problem. And the optimal solution may not be unique.

A more intuitive explanation for what is happening is given by Roger Koenker (the author of quantreg) on r-help back in 2006:

When computing the median from a sample with an even number of distinct values there is inherently some ambiguity about its value: any value between the middle order statistics is "a" median. Similarly, in regression settings the optimization problem solved by the "br" version of the simplex algorithm, modified to do general quantile regression identifies cases where there may be non uniqueness of this type. When there are "continuous" covariates this is quite rare, when covariates are discrete then it is relatively common, atleast when tau is chosen from the rationals. For univariate quantiles R provides several methods of resolving this sort of ambiguity by interpolation, "br" doesn't try to do this, instead returning the first vertex solution that it comes to.

Your second warning -- "153 non-positive fis" -- is a warning related to how the local densities are calculated by rq. Occasionally, it could be possible that local densities of the quantile regression function end up being negative (which is obviously impossible). If this happens, rq automatically sets them to zero. Again, quoting from the FAQ:

This is generally harmless, leading to a somewhat conservative (larger) estimate of the standard errors, however if the reported number of non-positive fis is large relative to the sample size then it is an indication of misspecification of the model.