6
votes

Does anybody know what is the best R alternative to SAS first. or last. operators? I did find none.

SAS has the FIRST. and LAST. automatic variables, which identify the first and last record amongst a group with the same value with a particular variable; so in the following dataset FIRST.model and LAST.model are defined:

Model,SaleID,First.Model,Last.Model
Explorer,1,1,0
Explorer,2,0,0
Explorer,3,0,0
Explorer,4,0,1
Civic,5,1,0
Civic,6,0,0
Civic,7,0,1
5
I have no access to SAS - what is .first or .last doing? Can you add an example?EDi
FIRST. and LAST. are not operators; they are automatic SAS data step variables defined to indicate column value changes during BY statement processing.BellevueBob
I don't think. but this link seems to have the answer. stat.ethz.ch/pipermail/r-help/2010-November/260997.htmlagstudy
Since not many of us know SAS, if you can explain what you would like to do, it might get an answer more quickly.Ricardo Saporta
there's probably a simple solution with diff(), too ...Ben Bolker

5 Answers

9
votes

It sounds like you're looking for !duplicated, with the fromLast argument being FALSE or TRUE.

d <- datasets::Puromycin

d$state
# [1] treated   treated   treated   treated   treated   treated   treated  
# [8] treated   treated   treated   treated   treated   untreated untreated
#[15] untreated untreated untreated untreated untreated untreated untreated
#[22] untreated untreated
#Levels: treated untreated
!duplicated(d$state)
# [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[13]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
!duplicated(d$state,fromLast=TRUE)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
#[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

There are some caveats and edge-case behaviors to this function, which you can find out through the help files (?duplicated).

4
votes

Update (to read first)

If you really are interested only in the row indexes, perhaps some straightforward use of split and range would be of use. The following assumes that the rownames in your dataset are sequentially numbered, but adaptations would probably also be possible.

irisFirstLast <- sapply(split(iris, iris$Species), 
                        function(x) range(as.numeric(rownames(x))))
irisFirstLast              ## Just the indices
#      setosa versicolor virginica
# [1,]      1         51       101
# [2,]     50        100       150
iris[irisFirstLast[1, ], ] ## `1` would represent "first"
#     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 1            5.1         3.5          1.4         0.2     setosa
# 51           7.0         3.2          4.7         1.4 versicolor
# 101          6.3         3.3          6.0         2.5  virginica
iris[irisFirstLast, ]      ## nothing would represent both first and last
#     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 1            5.1         3.5          1.4         0.2     setosa
# 50           5.0         3.3          1.4         0.2     setosa
# 51           7.0         3.2          4.7         1.4 versicolor
# 100          5.7         2.8          4.1         1.3 versicolor
# 101          6.3         3.3          6.0         2.5  virginica
# 150          5.9         3.0          5.1         1.8  virginica

d <- datasets::Puromycin   
dFirstLast <- sapply(split(d, d$state), 
                     function(x) range(as.numeric(rownames(x))))
dFirstLast
#      treated untreated
# [1,]       1        13
# [2,]      12        23
d[dFirstLast[2, ], ]       ## `2` would represent `last`
#    conc rate     state
# 12  1.1  200   treated
# 23  1.1  160 untreated

If working with named rows, the general approach is the same, but you have to specify the range yourself. Here's the general pattern:

datasetFirstLast <- sapply(split(dataset, dataset$groupingvariable), 
                           function(x) c(rownames(x)[1], 
                                         rownames(x)[length(rownames(x))]))

Initial answer (edited)

If you're interested in extracting the rows rather than needing the row number for other purposes, you can also explore data.table. Here are some examples:

library(data.table)
DT <- data.table(iris, key="Species")
DT[J(unique(Species)), mult = "first"]
#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa          5.1         3.5          1.4         0.2
# 2: versicolor          7.0         3.2          4.7         1.4
# 3:  virginica          6.3         3.3          6.0         2.5
DT[J(unique(Species)), mult = "last"]
#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa          5.0         3.3          1.4         0.2
# 2: versicolor          5.7         2.8          4.1         1.3
# 3:  virginica          5.9         3.0          5.1         1.8
DT[, .SD[c(1,.N)], by=Species]
#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa          5.1         3.5          1.4         0.2
# 2:     setosa          5.0         3.3          1.4         0.2
# 3: versicolor          7.0         3.2          4.7         1.4
# 4: versicolor          5.7         2.8          4.1         1.3
# 5:  virginica          6.3         3.3          6.0         2.5
# 6:  virginica          5.9         3.0          5.1         1.8

This last approach is pretty convenient. For instance, if you wanted the first three rows and last three rows of each group, you can use: DT[, .SD[c(1:3, (.N-2):.N)], by=Species] (Just for reference: .N represents the number of cases per group.

Other useful approaches include:

DT[, tail(.SD, 2), by = Species] ## last two rows of each group
DT[, head(.SD, 4), by = Species] ## first four rows of each group
4
votes

The head and tail function with an n=1 option combined with by are a good way to go. See R for SAS and SPss Users** (Robert Muenchen) Make a data frame with by variables of interest i.e for last.

dfby<- data.frame(df$var1, df$var2)
mylastList<-by(df,dfby,tail, n=1)
#turn into a dataframe
mylastDF<-do.call(rbind,mylastList)
3
votes

Here is a dplyr solution:

# input
dataset <- structure(list(Model = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L
), .Label = c("Civic", "Explorer"), class = "factor"), SaleID = 1:7), .Names = c("Model", 
"SaleID"), class = "data.frame", row.names = c(NA, -7L))


# code 
library(dplyr)

dataset %>% 

  group_by(Model) %>%

  mutate(
          "First"        = row_number() == min( row_number() ),
          "Last"         = row_number() == max( row_number() )
  )

# output:

     Model SaleID First  Last
    <fctr>  <int> <lgl> <lgl>
1 Explorer      1  TRUE FALSE
2 Explorer      2 FALSE FALSE
3 Explorer      3 FALSE FALSE
4 Explorer      4 FALSE  TRUE
5    Civic      5  TRUE FALSE
6    Civic      6 FALSE FALSE
7    Civic      7 FALSE  TRUE

PS: If you don't have dplyr installed run:

install.packages("dplyr")
1
votes

The function below is based on @Joe's description of First / Last.
The function returns a list of vectors.

Each list entry corresponds to the columns of the dataframe (ie the features or variables of the data set)
Then, within a given list entry, there is the index that pertains to the First (or last) element for every observation category.

EXAMPLE USAGE:

# Pass in your data frame, and indicate whether or not you want to find Last or find First. 
# Assign to the appropriate variable
first <- findFirstLast(myDF)
last  <- findFirstLast(myDF, findFirst=FALSE)

Example using data(iris)

data(iris)
first <- findFirstLast(iris)
last  <- findFirstLast(iris, findFirst=FALSE)

which observation for each Species:

 first$Species
 #    setosa versicolor  virginica 
 #        1         51        101 

 last$Species
 #    setosa versicolor  virginica 
 #        50        100        150 

Grab the whole row for each first observation of a sepcies

iris[first$Species, ]
#      Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#  1            5.1         3.5          1.4         0.2     setosa
#  51           7.0         3.2          4.7         1.4 versicolor
#  101          6.3         3.3          6.0         2.5  virginica




CODE FOR FUNCTION findFirstLast():

  findFirstLast <- function(myDF, findFirst=TRUE) {
  # myDF should be a data frame or matrix 

    # By default, this function finds the first occurence of each unique value in a column
    # If instead we want to find last, set findFirst to FALSE.  This will give `maxOrMin` a value of -1
    #    finding the min of the negative indecies is the same as finding the max of the positive indecies. 
    maxOrMin <- ifelse(findFirst, 1, -1) 


    # For each column in myDF, make a list of all unique values (`levs`) and iterate over that list, 
    #   finding the min (or max) of all the indicies of where that given value appears within the column  
    apply(myDF, 2, function(colm) {
        levs <- unique(colm)
        sapply(levs, function(lev) {
          inds <- which(colm==lev)
          ifelse(length(inds)==0, NA, maxOrMin*min(inds*maxOrMin) ) 
        })   
      })
  }