2
votes

I will preface this by saying that I am fairly new to R and have been stuck on this issue for a few weeks and seem to be getting no where. I am looking to perform a multivariate logistic regression to determine if water main material and soil type plays a factor in the location of water main breaks in my study area.

I have 417 positive water main break locations and create an additional 400 false locations to use in my analysis. I understand that the water main material and the soil type are both categorical variables and should be re-coded into dummy variables before using the GLM model. That is where I am having trouble. I have not worked with dummy variables until now and can't seem to understand how they are created in R. Below is the breakdown of the data I have and the current GLM model that I am using.

INDICATOR: 0 or 1 (Indicates if the location XY was or was not a water main break location)

MAIN MATERIAL: Material of the water main at the XY location (categorical value - about 8 unique values)

SOIL CLASSIFICATION: Type of soil at location of break (categorical value - around 20 values)

(logAnalysis <- glm(Indicator~main_material+soil_classification, data=Breaks, family=binomial (link="logit"))

I have only used Stack Exchange one other time so if more information is needed, please let me know.

After trying Aurther's suggestion of using factor(), this is the output that I get. R Ouput

I am a bit confused why many of the soil classifications and the PE main material have such high Std. Errors.

1
are you getting an error message?Arthur Yip
Arthur - I was not getting an error but have only recently realized that I should be looking at dummy variables recently. I was able to run the GLM without re-coded but the results were not accurate (not even close actually!)Rmoore

1 Answers

1
votes

factor() is R's "dummy variable" Try:

(logAnalysis <- glm(Indicator~main_material+factor(soil_classification), data=Breaks, family=binomial(link="logit"))