0
votes

I have a (directed) dyadic dataset that looks something like this (see below). What I want to do now is to keep just one observation per year. So in this case only one observation for 1992 (AFG 1992) and one in 1993 (AFG 1993), while deleting other observations. It doesn't matter which observation from the same year I keep in the data (not interested in country2).

 country1   country2    year    X   X1
Afghanistan Colombia    1992    1   0.44
Afghanistan Venezuela   1992    1   0.45
Afghanistan Peru        1992    1   0.46
Afghanistan Brazil      1992    1   0.47
Afghanistan Bolivia     1992    1   0.48
Afghanistan Chile       1992    1   0.49
Afghanistan Argentina   1992    1   0.50
Afghanistan Uruguay     1993    0   0.51
Afghanistan USA         1993    0   0.52
Afghanistan Canada      1993    0   0.53
Afghanistan UK          1993    0   0.54
Afghanistan Netherlands 1993    0   0.55
Afghanistan Belgium     1993    0   0.56
Afghanistan Luxembourg  1993    0   0.57
Afghanistan France      1993    0   0.58

My try:

newdata<- data %>% 
  group_by(country1,year) %>%
  summarise() %>%
  select(unique.x=country1, unique.y=year)

This works BUT how do I keep all other variables from "data" in the "newdata"? I can't think of any way of doing this (which I find more practical). Any help?

Desired outcome

    country1     year   X
    Afghanistan 1991   1
    Afghanistan 1992   0

dput(data) structure(list(country1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Afghanistan", class = "factor"), country2 = structure(c(8L, 33L, 24L, 5L, 4L, 7L, 1L, 32L, 31L, 6L, 30L, 21L, 3L, 19L, 14L, 29L, 27L, 26L, 15L, 25L, 2L, 17L, 10L, 18L, 13L, 28L, 23L, 11L, 9L, 16L, 12L, 20L, 22L), .Label = c("Argentina", "Austria", "Belgium", "Bolivia, Plurinational State of", "Brazil", "Canada", "Chile", "Colombia", "Cuba", "Czech Republic", "Denmark", "Dominican Republic", "Finland", "France", "Germany", "Guinea-Bissau", "Hungary", "Italy", "Luxembourg", "Mauritania", "Netherlands", "Niger", "Norway", "Peru", "Poland", "Portugal", "Spain", "Sweden", "Switzerland", "United Kingdom", "United States", "Uruguay", "Venezuela, Bolivarian Republic of"), class = "factor"), year = c(1992L, 1992L, 1992L, 1992L, 1992L, 1992L, 1992L, 1993L, 1993L, 1993L, 1993L, 1993L, 1993L, 1993L, 1993L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1995L, 1995L, 1995L, 1995L, 1995L, 1995L, 1995L, 1995L, 1995L, 1995L), X = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), X1 = c(0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76)), .Names = c("country1", "country2", "year", "X", "X1"), class = "data.frame", row.names = c(NA, -33L))

3
I think you can just do unique(data) in your case - David Arenburg
it doesn't work, I've tried it - FKG
It works on your example data. - David Arenburg
and I'm not sure why. Perhaps because I have country1 and country2 variables in my (dyadic) data. - FKG
If you want one observation per year you could do library(data.table) ; unique(setDT(data), by = "year"). Or just use duplicated from base R Or distinct(data, year, .keep_all = TRUE) from tibbliverse - David Arenburg

3 Answers

1
votes

newdata <- olddata[!duplicated(olddata$year),]

Answers the question

newdata <- olddata[!duplicated(paste(olddata$country1, olddata$year)),]

Gives you what you want

0
votes

I don't truly understand your question, but to get your desired output you can use:

data %>% 
  group_by(country1, year) %>%
  summarise(X = mean(X))

When you apply this to your entire data.frame, bear in mind this code will return the mean of all values in X for unique combinations of country1 and year.

0
votes

you can try:

data %>%
    group_by(year) %>%
    top_n(1) %>%
    select(country1, X)