R produces different result after IO on file

Question

As new to R, I have a question about writing and reading vector data.

My Example 1

n = 100
g = 6
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
                y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
plot(d)
require(vegan)
fit <- cascadeKM(scale(d, center = TRUE,  scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")

(source), it prints "Calinski criterion optimal number of clusters: 5" as expected.

Example 2: (write data frame d first, then read it)

n = 100
g = 6 
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
                y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))

write.table(d, "d.txt", sep='\t', quote=FALSE) #write data frame
d = read.table("d.txt", header=TRUE, sep = '\t') #read later
plot(d)

require(vegan)
fit <- cascadeKM(scale(d, center = TRUE,  scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")

However, example 2 prints "Calinski criterion optimal number of clusters: 1".

I think the format (or something else) has been changed after IO from file in R. But i have no knowledge about how R read and write numbers. Can anyone give me some clues, thanks.

EDIT If the file is written without col name and row name, problem solved.

write.table(d, "d.txt", sep='\t', quote=FALSE, row.name=FALSE, col.names=FALSE)

When reading, R also reads the row and col names,. Another is to escape those names when reading.

The plot is the same in both cases? The code seem exactly the same... Did you print after reading again to compare if it is the same (maybe just the head of it will do head(d)) — llrs
The plot is the same and the print of table frame d is the same. — user200340
Writing numbers to a text file is very likely going to lose some precision. Use save or saveRDS instead. — Joshua Ulrich
Thanks Joshua, as a beginner and on the experiment data, I'd like to have text file since it is much easier to read and modify. I will use saveRDS in the real data. — user200340

Jari Oksanen Jari Oksanen · Accepted Answer · 2014-01-10T11:46:07

The Calinski index cannot be calculated for one group, but it becomes either Inf or -Inf. In the first example it happened to be -Inf, and in the second it happened to be Inf and when you looked for which.max, Inf was what you got. I do not know why we actually bother to calculate the index for one-class case, but if you search for the best result, you should ignore the first case. We do so in the plot command which gives five clusters as the best result in both cases. The following modification of your code will give the same answer in both cases:

calinski.best <- as.numeric(which.max(fit$results[2,-1])) + 1
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")

We had to have +1 because we omitted one column.

Small details for the Inf/-Inf indeterminacy. As you see in ?cascadeKM, Calinski criterion is defined as (SSB/(K-1))/(SSW/(n-K)) and for one group K=1 so that SSB/0 = Inf. For one group also SSB=0, but that is computed zero and these are rarely exact in digital computers, and in my computer zero is -2.8e-14 and -2.8e-14/0 = -Inf. In the second example SSB=2.8e-14 and 2.8e-14/0 = Inf. Just ignore the first column when you look for the optimum. Occasionally SSB can be exact zero and then 0/0 = NaN (not a number).

R produces different result after IO on file

1 Answers