1
votes

I am running an analysis of Citibike data and with help have created a code that works perfectly - it extracts all of the trips of bikes that started from a different station where the bike last stopped (that is, all instances where the start.station.id is different from the previous end.station.id, which means the bike was moved by a truck. The monthly datasets, however, are very large, with the summer months containing over 1 million individual trips (you can find them here: Citibike data.

Here is a snapshot of the dataset:

head(Nov2015, n = 20)
   tripduration          starttime           stoptime start.station.id                start.station.name
1          1110 11/1/2015 00:00:00 11/1/2015 00:18:31              537           Lexington Ave & E 24 St
2          1094 11/1/2015 00:00:01 11/1/2015 00:18:15              537           Lexington Ave & E 24 St
3           520 11/1/2015 00:00:05 11/1/2015 00:08:45              536                   1 Ave & E 30 St
4           753 11/1/2015 00:00:15 11/1/2015 00:12:48              229                    Great Jones St
5           353 11/1/2015 00:00:22 11/1/2015 00:06:15              285                Broadway & E 14 St
6          1285 11/1/2015 00:00:22 11/1/2015 00:21:48              268             Howard St & Centre St
7           477 11/1/2015 00:00:25 11/1/2015 00:08:23              379                   W 31 St & 7 Ave
8           362 11/1/2015 00:00:28 11/1/2015 00:06:30              407              Henry St & Poplar St
9          2316 11/1/2015 00:00:37 11/1/2015 00:39:14              147          Greenwich St & Warren St
10          627 11/1/2015 00:00:42 11/1/2015 00:11:10              521                   8 Ave & W 31 St
11         2304 11/1/2015 00:00:44 11/1/2015 00:39:08              147          Greenwich St & Warren St
12         1471 11/1/2015 00:01:04 11/1/2015 00:25:35              281 Grand Army Plaza & Central Park S
13         1484 11/1/2015 00:01:36 11/1/2015 00:26:21              281 Grand Army Plaza & Central Park S
14          284 11/1/2015 00:01:36 11/1/2015 00:06:20              247            Perry St & Bleecker St
15          886 11/1/2015 00:01:39 11/1/2015 00:16:25              492                   W 33 St & 7 Ave
16          886 11/1/2015 00:01:42 11/1/2015 00:16:28              492                   W 33 St & 7 Ave
17         1379 11/1/2015 00:01:44 11/1/2015 00:24:44              512                   W 29 St & 9 Ave
18          179 11/1/2015 00:01:47 11/1/2015 00:04:47              319              Fulton St & Broadway
19          309 11/1/2015 00:01:51 11/1/2015 00:07:00              160           E 37 St & Lexington Ave
20          616 11/1/2015 00:02:08 11/1/2015 00:12:24              479                   9 Ave & W 45 St
   start.station.latitude start.station.longitude end.station.id         end.station.name
1                40.74026               -73.98409            531   Forsyth St & Broome St
2                40.74026               -73.98409            531   Forsyth St & Broome St
3                40.74144               -73.97536            498       Broadway & W 32 St
4                40.72743               -73.99379            328  Watts St & Greenwich St
5                40.73455               -73.99074            151 Cleveland Pl & Spring St
6                40.71911               -73.99973            476          E 31 St & 3 Ave
7                40.74916               -73.99160            546     E 30 St & Park Ave S
8                40.70047               -73.99145            310      State St & Smith St
9                40.71542               -74.01122            441          E 52 St & 2 Ave
10               40.75097               -73.99444            285       Broadway & E 14 St
11               40.71542               -74.01122            441          E 52 St & 2 Ave
12               40.76440               -73.97371            367  E 53 St & Lexington Ave
13               40.76440               -73.97371            367  E 53 St & Lexington Ave
14               40.73535               -74.00483            453          W 22 St & 8 Ave
15               40.75020               -73.99093            377         6 Ave & Canal St
16               40.75020               -73.99093            377         6 Ave & Canal St
17               40.75007               -73.99839            445       E 10 St & Avenue A
18               40.71107               -74.00945            264     Maiden Ln & Pearl St
19               40.74824               -73.97831            362       Broadway & W 37 St
20               40.76019               -73.99126            440          E 45 St & 3 Ave
   end.station.latitude end.station.longitude bikeid   usertype birth.year gender
1              40.71894             -73.99266  22545 Subscriber       1981      2
2              40.71894             -73.99266  23959 Subscriber       1980      1
3              40.74855             -73.98808  22251 Subscriber       1988      1
4              40.72406             -74.00966  15869 Subscriber       1981      1
5              40.72210             -73.99725  21645 Subscriber       1987      1
6              40.74394             -73.97966  14788   Customer         NA      0
7              40.74445             -73.98304  21128 Subscriber       1962      2
8              40.68927             -73.98913  21016 Subscriber       1978      1
9              40.75601             -73.96742  24117 Subscriber       1988      2
10             40.73455             -73.99074  17048 Subscriber       1986      2
11             40.75601             -73.96742  18241 Subscriber       1984      1
12             40.75828             -73.97069  24223   Customer         NA      0
13             40.75828             -73.97069  16779   Customer         NA      0
14             40.74475             -73.99915  17272 Subscriber       1976      1
15             40.72244             -74.00566  15008 Subscriber       1981      1
16             40.72244             -74.00566  23019 Subscriber       1982      1
17             40.72741             -73.98142  23843 Subscriber       1962      2
18             40.70706             -74.00732  22538 Subscriber       1981      1
19             40.75173             -73.98754  22042 Subscriber       1988      1
20             40.75255             -73.97283  22699 Subscriber       1982      1

And of the code I used to extract the "hidden" bike movements and put them into a coherent data.frame:

raw_data = read.csv("201511-citibike-tripdata.csv")
unique_id = unique(raw_data$bikeid)


output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0),  "stoptime" = character(),"starttime" = character(), stringsAsFactors=FALSE)

for (bikeid in unique_id)
{
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]

if(nrow(onebike) >=2 ){
for(i in 2:nrow(onebike )) {
if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) &
onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%m/%d/%Y %H:%M:%S"),
                                strptime(onebike[i-1,"stoptime"], "%m/%d/%Y %H:%M:%S")
                                ,units = "mins"))
new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"]))
output[nrow(output) + 1,] = new_row
}
}
}
}

Because it uses a for loop, the extraction takes a very long time. Is there a way to speed up this process or rewrite the code in a way that avoids using a for loop?

The output should stay exactly the same. Namely:

head(output)
  bikeid end.station.id start.station.id        diff.time           stoptime          starttime
1  22545            520              529 24.8166666666667 11/2/2015 08:38:22 11/2/2015 09:03:11
2  22545            520              517 537.483333333333 11/2/2015 09:39:19 11/2/2015 18:36:48
3  22545           2004             3230 563.066666666667 11/2/2015 22:06:27 11/3/2015 07:29:31
4  22545            296             3236 471.783333333333 11/4/2015 23:40:29 11/5/2015 07:32:16
5  22545            520              449 43.4166666666667 11/9/2015 08:24:06 11/9/2015 09:07:31
6  22545            359              519 30.7166666666667 11/9/2015 09:14:46 11/9/2015 09:45:29
1
I know you want a loop-less solution, but look into the package foreach - cran.r-project.org/web/packages/foreach/foreach.pdfJubbles
Please explain what your code does. Parsing it is slightly more difficult than writing a vectorized solution.Roland
It is easy to understand if you download the dataset from the website - it's free. The code simply loops over all of the bike trip records - selects a unique bike ID, and finds the records where the bike started at a different station than where it ended. It does this for every unique bike id (there are over 7,000) and puts all of these trips together into a data frameiskandarblue
@iskandarblue from your description this should be very easy to vectorize. However, understanding you code is not so easy. Strip away all the fluff and only present the essential parts of your question. Often just doing this will allow you to answer your question. Also it is ideal to provide a toy data set within the question. I'm not going to go download a giant Citibank data set.Jacob H
Check out this question for making a reproducible example. Giving us the output of dput(head(raw_data)) would be very helpful, for a start. stackoverflow.com/questions/5963269/…C_Z_

1 Answers

0
votes

This solution using data.table takes a few minutes for me.

Uses the shift function and adds previous row's data to the current row by bikeid to the existing data.table.

Then we filter by !is.na(end.station.id)&(end.station.id!=start.station.id) followed by removing columns not needed and then setting the column order.

library(data.table)
bikedata<-fread('201511-citibike-tripdata.csv')
colnames(bikedata)<-make.names(colnames(bikedata))


bikedata[,c("end.station.id",
            "diff.time",
            "stoptime",
            "starttime") :=
           list(shift(end.station.id,1L,type="lag"),
                as.double(difftime(strptime(starttime, 
                                            "%m/%d/%Y %H:%M:%S"),
                                   strptime(shift(stoptime,1L,type="lag"), 
                                            "%m/%d/%Y %H:%M:%S")
                                   ,units = "mins")),
                as.character(shift(stoptime,1L,type="lag")),
                as.character(starttime)
           ),
         by=bikeid]

bikedatamoved<-bikedata[!is.na(end.station.id)&(end.station.id!=start.station.id)]

bikedatamoved[,
              setdiff(colnames(bikedatamoved),c("bikeid","end.station.id",
                                                "start.station.id",
                                                "diff.time",
                                                "stoptime",
                                                "starttime")):=NULL]

setcolorder(bikedatamoved, c("bikeid", 
                             "end.station.id", 
                             "start.station.id",        
                             "diff.time",           
                             "stoptime",          
                             "starttime"))