Making Sense of Time Series Data with > 43,000 observations

Question

Updated Post

After a lot of work, I have finally merged three different datasets. The result is a time series data frame with 43,396 observations of 7 seven variables. Below, I have included a few rows of what my data looks like.

     Dyad  year  cyberattack  cybersev MID   MIDsev   peace score
     2360  2005    NA          NA       0      1          0
     2360  2006    NA          NA       NA     NA         0
     2360  2007    1           3.0      0      1          0
     2360  2008    1           4.0      0      1          0
     2360  2009    3           3.33     1      4          0
     2360  2010    1           3.0      NA     NA         0
     2360  2011    3           2.0      NA     NA         0
     2360  2012    1           2.0      NA     NA         0 
     2360  2013    4           2.0      NA     NA         0

If I am interested in comparing how different country pairs (dyads) differ in how often they launch attacks (either in cyberspace, physically with MIDs, or neither)...how should I go about doing this?

Since I am working with country/year data, how can I get descriptive statistics for the different countries (Dyads) in my Dyad variable? For example, I would like to know how the behavior of Dyad 2360 (USA and Iran) compares with other countries.

I tried this code, but it just gave me a list of my unique dyad pairs:

    table(final$Dyadpair) 
    names(sort(-table(final$Dyadpair)))

You mentioned using aggregate or dplyr - but I don't see how those will allow me to descriptive statistics for all of my unique dyads? Would you mind elaborating on this?

Is it possible for a code to return something like this: For Dyad 2360 during the years 2005-2013, 80% were NA, 10% were cyber attacks, and 10% were MID attacks, etc. ?

Upate to clarify:

Ok, yes - the above example was just hypothetical. Based on the nine rows of data that I have provided - here is what I am hoping R can provide when it comes to descriptive statistics.

Dyad: 2360 No attacks: 22.22% (2/9) ….in 2005 and 2006

Cyber attacks: 77.78% (7/9) ….in the years 2007-2013

MID attacks: 11.11% (1/9) ….in 2009

Both cyber and MID: 11.11% (1/9) ….in 2009

Essentially, during a given time range (2005-2013 for the example I gave above), how many of those years result in NO attacks, how many of those years result in a cyber attack, how many of those years result in a MID attack, and how many of those years result in both a cyber and MID attack.

I do not know if this is possible with how my data is set up —> since I aggregated cyber-attacks and MID attacks per year? And yes, I would also like to take into consideration the severity of the attacks (both cyber attacks and MID attacks), but I don’t know how to do that.

Does this help clarify what I am looking for?

When you merge, you can set the argument all = TRUE to keep all records. For the rest, "how to make sense of my data so that it comes across in a paper and presentation" is far too broad. Stack Overflow is for specific, answerable, programming questions---that is a general, open-ended question about data analysis and communication. — Gregor Thomas
Some general advice - you've identified some weaknesses, like inconsistent use of NA, different rating scales, etc. Whether and how much those will cause problems will depend on how you analyze them, but consistency is good and will generally make things better. I would advise (a) using NA consistently for missing values, rather than for 0s, (b) using consistent scales--1 makes sense to me as a non-severe attack, 0 as no attack, and NA as "we don't know". Transforming your data to do (a) and (b) is probably a good idea. And do so before you aggregate and take averages. — Gregor Thomas
As to getting descriptive statistics for unique dyads, in base R aggregate, which you're already using, is a good tool for that. You'll have to define what you mean exactly by "the percentage of the time they launch cyber attacks" - maybe you mean the percentage of all attacks that are cyber attacks, or maybe you mean the percentage of years with attacks that include cyber attacks, or maybe you mean something else. While aggregate is good in base R, you may find dplyr more powerful, here's a nice introduction. — Gregor Thomas
Hello, @Gregor. Thank you for your feedback. I have updated my post with a more specific question. — newtoR
@Gregor Also, adding the "all = TRUE" to my merge code, worked. Thank you. I would most appreciate if you could take a look at my updated post. — newtoR

Gregor Thomas Gregor Thomas · Accepted Answer · 2019-10-08T01:50:49

Here's a dplyr approach with my best guess for what you want. It will output a data frame with one row per dyad and the same summary statistics for each dyad.

library(dplyr)
your_data %>%
  group_by(Dyad) %>%
  summarize(
    year_range = paste(min(year), max(year), sep = "-"),
    no_attacks = mean(is.na(cyberattack) & (is.na(MID) | MID == 0)),
    cyber_attacks = mean(!is.na(cyberattack)),
    MID_attacks = mean(!is.na(MID) & MID > 0),
    cyber_and_MID = mean(!is.na(cyberattack) & (!is.na(MID) & MID > 0)),
    cyber_sev_weighted = weighted.mean(cyberattack, w = cybersev, na.rm = TRUE)
  )

# # A tibble: 1 x 7
#    Dyad year_range no_attacks cyber_attacks MID_attacks cyber_and_MID cyber_sev_weighted
#   <int> <chr>           <dbl>         <dbl>       <dbl>         <dbl>              <dbl>
# 1  2360 2005-2013       0.222         0.778       0.111         0.111               1.86

Using this data:

your_data = read.table(text = 'Dyad  year  cyberattack  cybersev MID   MIDsev   peace_score
     2360  2005    NA          NA       0      1          0
     2360  2006    NA          NA       NA     NA         0
     2360  2007    1           3.0      0      1          0
     2360  2008    1           4.0      0      1          0
     2360  2009    3           3.33     1      4          0
     2360  2010    1           3.0      NA     NA         0
     2360  2011    3           2.0      NA     NA         0
     2360  2012    1           2.0      NA     NA         0 
     2360  2013    4           2.0      NA     NA         0', header = T)

Making Sense of Time Series Data with > 43,000 observations

1 Answers