1
votes

First of all, please download my data set from http://alexandervanloon.nl/survey_oss.csv and then execute the following content of a script to get a few scatter plots:

# read data and attach it
survey <- read.table("survey_oss.csv", header=TRUE)
attach(survey)

# plot for inhabitants
png("scatterINHABT.png")
plot(INHABT, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS", las=1)
abline(lm(OSSADP~INHABT)) # regression line (y~x)
dev.off()

# plot for inhabitants divided by 1000
png("scatterINHABT_divided.png")
plot(INHABT/1000, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS", las=1)
abline(lm(OSSADP~INHABT)) # regression line (y~x)
dev.off()

# plot for inhabitants in logarithmic scale
png("scatterINHABT_log.png")
plot(INHABT, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS", las=1, log="x")
abline(lm(OSSADP~INHABT)) # regression line (y~x)
dev.off()

# plot for inhabitants in logarithmic scale and divided by 1000
png("scatterINHABT_log_divided.png")
plot(INHABT/1000, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS", las=1, log="x")
abline(lm(OSSADP~INHABT)) # regression line (y~x)
dev.off()

As you can see, in the first scatterplot the problem is that R decides to use scientific notation and the data looks odd because of outliers. That's why I'd like to have the inhabitants on x-axis in thousands and have the x-axis use a logarithmic scale as well.

The problem is twofold. First, I can get rid of scientific notation by simply dividing the inhabitants by 1000, but this produces a flat horizontal regression line unlike the first plot. I know there are other ways to fix this such as Do not want scientific notation on plot axis but I couldn't adapt the code there to my situation.

Second, switching the x-axis to a logarithmic scale also makes the regression line flat. Google points to https://stat.ethz.ch/pipermail/r-help/2006-January/086500.html as a first result for a possible solution and I tried using abline(lm(OSSADP~log10(INHABT))) which is suggested there, but that produces a vertical regression line. And if I divide both by 1000 and use a logarithmic scale, the line is also horizontal.

I'm a social scientist without any background in mathematics and statistics, so I fear I might have missed something obvious, if so my apologies. Thank you all very much for any potential help.

2

2 Answers

0
votes

The scientific notation was covered on the R mailing list a while ago, but you can control how R chooses when to go to scientific notation with options()$scipen .

options(scipen=10)
plot(INHABT, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS")

Second, the problem with your dividing by 1000 is that you didn't divide by a thousand in both the plot and the abline. This would do the trick:

plot(INHABT/1000, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS")
abline(lm(OSSADP~I(INHABT/1000))) # Fixed regression line.

The I is neccessary because the / symbol has a different meaning in formulas.

Also, your las parameter is unnecessary.

0
votes

I solved the problem of horizontal line when use log="x" like this:

plot(INHABT, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS", log="x")
abline(lm(OSSADP~log10(INHABT)))

with log10 and not just log.