1
votes

I have a dataframe with more than 1 million columns (I converted a raster stack into a dataframe). Among these 1 million columns only a thousands of them have data. First two rows of the data frame have latitude and longitude information. How can i delete columns with no data however every column has data as latitude and longitude information.

Sample:

> head(data[,c(1:8)])
            [,1]      [,2]      [,3]      [,4]      [,5]    [,6]      [,7]      [,8]
 x         -961887.6 -960959.8 -960032.1 -959104.4 -958176.7 -957249 -956321.2 -955393.5
 y         2816074.2 2816074.2 2816074.2 2816074.2 2816074.2 2816074 2816074.2 2816074.2
 X2012273.        NA        NA        NA        NA        NA      NA        NA        NA
 X2012281.        NA        NA        NA        NA        NA      NA        NA        NA
 X2012289.        NA        NA        NA        NA        NA      NA        NA        NA
 X2012297.        NA        NA        NA        NA        NA      NA        NA        NA

My question is how can i exclude first two rows and delete all no data columns at once.

I tried following code: number of rows in dataframe ( data ) are 22 including latitude row and longitude row. I applied the logic:

for (i in 1:ncol(data)) {
    y = sum(is.na(data[,i]))
    if(y == (length(data[,i]) - 2)) {
        data[,-i]
    }
 }

This for loop may take a long time and eventually will not execute successfully.

1
Please provide a sample dataDJV
I updated the question with sample dataLostman
Not clear to me. Which rows you want to delete? Which columns you want to delete?Ronak Shah
I just want to delete columns which have no data values (NA). But the problem is if they don't have data but they still have lat/long information in first two rows.Lostman

1 Answers

0
votes

Based on my understanding of your question, here is a possible solution:

First, your data set for the purpose of reproducible example:

dput(data)

structure(list(V1 = structure(c(1L, 6L, 2L, 3L, 4L, 5L), .Label = c("x", "X2012273",     "X2012281", "X2012289", "X2012297", "y"), class = "factor"), V2 = c(-961887.6, 2816074.2, NA, NA, NA, NA), V3 = c(-960959.8, 2816074.2, NA, NA, NA, NA), V4 = (-960032.1, 2816074.2, NA, NA, NA, NA), V5 = c(-959104.4, 2816074.2, NA, NA, NA, NA), V6 = c(-958176.7, 2816074.2, NA, NA, NA, NA), V7 = c(-957249L, 2816074L, NA, NA, NA, NA), V8 = c(-956321.2, 2816074.2, NA, NA, NA, NA), V9 = c(-955393.5, 2816074.2, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -6L))

Next, to delete all the columns that have only NAs:

data2 = data[, colSums(is.na(data)) != nrow(data) - 2]

data2 is the final data frame that should meet your requirement. In the above code, it is checking whether the sum of NAs in a given column is equal to its number of rows minus two, if yes then delete such columns.

The final output i.e., data2 would like as below:

data2
[1] x        y        X2012273 X2012281 X2012289 X2012297

As you can see, x, y which represent longitude and latitude are a part of the output, which is what you wanted.