0
votes

I have a data frame with the following dimensions:

18549282 obs. of  3 variables:

$ road: chr  "MULTILINESTRING((30.5592664 -30.5971316,30.5597665 -30.5964615))" ...
$ n1       : int  0 0 0 0 0 0 0 0 0 0 ...
$ n2       : int  0 0 0 0 0 0 0 0 0 0 ...

There are no blank records in the road column, meaning that every record has a character.

When I use dplyr's group_by along with summarize to get the sum of n1 and sum of n2 by road I get a sum of n1 and n2 but I see a blank in the road column. e.g.

tt %>%
group_by(road) %>%
summarize(sn1 = sum(n1),
sn2 = sum(n2))

I get:

enter image description here

Again I'm 100% sure that there are no blanks in the road column.

But when I create a data frame with, lets say 1000 records as follows

small_dataset <- head(tt, 1000)

I don't see any blank records in the results:

enter image description here

Seems that dplyr strudels with the large amount of data.

Any ideas on how I can handle this issue?

1
"I'm 100% sure that there are no blanks", how did you test this? What is the output of sum(tt$road == "") ?zx8754
Jip 100%, tested it a couple of times. I found a solution in rquery, see below. It produces no blanks in the results. Very strange. As soon as I have this project done I'll reinstall R and all the packages maybe an issue on my PC.Jacobus
Could that be that dplyr is seeing "NULL" as blank and rquery ignores them?zx8754

1 Answers

0
votes

I found a solution, found a package I wasn't familiar with called rquery.

tt  %>%
rquery::project(., 
count := n(),
 n1 := sum(n1),
n2 :=  sum(n2),
groupby = 'road')

solved my issue and it is faster than dplyr's group_by function.