1
votes

I am trying to do some analysis in a data-set (homicide rates in Brazil). Data is simple but I am learning, so not so simple for me anyway... After creating subsets grouping info by year, state and region, I can't still understand how to group these subsets into a bigger one (states by region). I would like to group all the regions on one bigger 'subset' so I can plot the information and instead of having the plot with data being shown by state, having it by region instead. It's probably simple and silly but I wasted a couple of hours googling and trying different codes, nothing works so far.

See plot example here

North <- subset(Homicides, State == 'AM' | State == 'RR'| State == 'AP' | State == 'PA' | State == 'TO' | State == 'RO' | State == 'AC')
Northeast <- subset(Homicides, State == 'MA' | State == 'PI'| State == 'CE' | State == 'RN' | State == 'PE' | State == 'PB' | State == 'SE' | State == 'AL' | State == 'BA')
Midwest <- subset(Homicides, State == 'MT' | State == 'MS'| State == 'GO'| State == 'DF')
Southeast <- subset(Homicides, State == 'SP' | State == 'RJ'| State == 'ES'| State == 'MG')
South <- subset(Homicides, State == 'PR' | State == 'RS'| State == 'SC')
AllRegions <- # How to group them so I can plot correctly?

And for the plot code:

ggplot(Homicides, aes(x = Year, y = TotalRate, group = State, color = State)) + # Where state should be the regions instead
  geom_line() +
  geom_point(size = 1) + 
  ggtitle("Total Homicides") +
  theme_hc() +
  scale_colour_hc()

How the dataset file looks like (for understanding)

    State Year TotalRate FirearmsRate
1      AC 1979        34           13
2      AC 1980        26           12
3      AC 1981        28            8
4      AC 1982        41           18
5      AC 1983        33           12
6      AC 1984        36           13
1
Welcome to SO. Help us help you by posting the output of dput(Homicides) or if the output is really long dput(head(Homicides). I think you can solve this by creating a new column with some combination of mutate and case_when.Ben G
So you want to have x number of plots where x is equal to the number of regions? And each plot only includes the subset of states for each region? If yes, I think ggplot2::facet_wrap might help here. If instead you are only interested in plotting the regions in one graph, I think you should first summarise the data by region; then you can plot some summary statistic by region.Giovanni Colitti
@GiovanniColitti Basically I want one single plot with info by region instead of by state. For example: South region consists in RS, SC and PR states. If I plot all the states, its hard to read, too many information...Xamineh
@BenG Result for dput head below > dput(head(Homicides)) structure(list(State = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("AC", "AL", "AM", "AP", "BA", "CE", "DF", "ES", "GO", "MA", "MG", "MS", "MT", "PA", "PB", "PE", "PI", "PR", "RJ", "RN", "RO", "RR", "RS", "SC", "SE", "SP", "TO"), class = "factor"), Year = 1979:1984, TotalRate = c(34L, 26L, 28L, 41L, 33L, 36L), FirearmsRate = c(13L, 12L, 8L, 18L, 12L, 13L)), row.names = c(NA, 6L), class = "data.frame")Xamineh

1 Answers

1
votes

As @Giovanni noted summarizing the data by region is key here. This is what I would consider a quintessential problem to solve using the dplyr package, part of the tidyverse

The sample data you provided didn't have all the components needed to answer the question so I changed it to include more than one state and to have the same years for each of those states:

df <- read.table(header = TRUE, text ="
State Year TotalRate FirearmsRate
AC    1979        34           13
AM    1980        26           12
MA    1979        28            8
PI    1980        41           18
MT    1979        33           12
MS    1980        36           13")

I then applied your criteria on a new column:

df_summary <- df %>%
 mutate(region = case_when(
         State == 'AM' | State == "AC" ~ "North",
         State == "MA" | State == "PI" ~ "Northeast",
         State == "MT" | State == "MS" ~ "Midwest")) %>%
 group_by(region, Year) %>%
 summarize(average_rate = mean(TotalRate))

# which looks like this

# A tibble: 6 x 3
# Groups:   region [3]
  region     Year average_rate
  <chr>     <int>        <dbl>
1 Midwest    1979           33
2 Midwest    1980           36
3 North      1979           34
4 North      1980           26
5 Northeast  1979           28
6 Northeast  1980           41

You can then use this in your ggplot (I didn't use your theme elements because I'm not sure where they came from:

ggplot(df_summary, aes(x = Year, y = average_rate, color = region)) +
  geom_line() +
  geom_point(size = 1) + 
  ggtitle("Total Homicides")

graph