0
votes

I have several sets of data that I calculate binned normalized differences for. The results I want to plot within a single line plot using ggplot. The lines representing different combinations of the paired differences are supposed to be distinguished by colors and line types.

I am stuck on taking the computed values from the bins (would be y-axis values now), and plotting these onto an x-axis.

Below is the code I use for importing the data and calculating the normalized differences.

# Read data from column 3 as data table for different number of rows
# you could use replicate here for test
# dat1 <- data.frame(replicate(1,sample(25:50,10000,rep=TRUE)))
# dat2 <- data.frame(replicate(1,sample(25:50,9500,rep=TRUE)))
dat1 <- fread("/dir01/a/dat01.txt", header = FALSE, data.table=FALSE, select=c(3))
dat2 <- fread("/dir02/c/dat02.txt", header = FALSE, data.table=FALSE, select=c(3))

# Change column names
colnames(dat1) <- c("Dat1")
colnames(dat2) <- c("Dat2")

# Perhaps there is a better way to compute the following as all-in-one? I have broken these down step by step.
# 1) Sum for each bin
bin1 = cut(dat1$Dat1, breaks = seq(25, 50, by = 2))
sum1 = tapply(dat1$Dat1, bin1, sum)

bin2 = cut(dat2$Dat2, breaks = seq(25, 50, by = 2))
sum2 = tapply(dat2$Dat2, bin2, sum)

# 2) Total sum of all bins
sumt1 = sum(sum1)
sumt2 = sum(sum2)

# 3) Divide each bin by total sum of all bins
sumn1 = lapply(sum1, `/`, sumt1)
sumn2 = lapply(sum2, `/`, sumt2)

# 4) Convert to data frame as I'm not sure how to difference otherwise
df_sumn1 = data.frame(sumn1)
df_sumn2 = data.frame(sumn2)

# 5) Difference between the two as percentage
dbin = (df_sumn1 - df_sumn2)*100

How can I plot those results using ggplot() and geom_line()? I want

  • dbin values on the x-axis ranging from 25-50
  • different colors and line types for the lines

Here is what I tried:

p1 <- ggplot(dbin, aes(x = ?, color=Data, linetype=Data)) +
            geom_line() +
            scale_linetype_manual(values=c("solid")) +
            scale_x_continuous(limits = c(25, 50)) +
            scale_color_manual(values = c("#000000"))

dput(dbin) outputs:

structure(list(X.25.27. = -0.0729132928804117, X.27.29. = -0.119044772581772,
    X.29.31. = 0.316016473225017, X.31.33. = -0.292812782147632,
    X.33.35. = 0.0776336591308158, X.35.37. = 0.0205584754637611,
    X.37.39. = -0.300768421159599, X.39.41. = -0.403235174844081,
    X.41.43. = 0.392510458816457, X.43.45. = 0.686758883448307,
    X.45.47. = -0.25387105113263, X.47.49. = -0.0508324553382303), class = "data.frame", row.names = c(NA,
-1L))

Edit

The final piece of code that works, using only the dbin and plots multiple dbins:

dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100)))
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100)))
dat3 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 12:37/100)))
dat4 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 37:12/100)))

calc_bin_props <- function(data) {
  as_tibble(data) %>%
    mutate(bin = cut(a, breaks = seq(25, 50, by = 2))) %>%
    group_by(bin) %>%
    summarise(sum = sum(a), .groups = "drop") %>%
    filter(!is.na(bin)) %>%
    ungroup() %>%
    mutate(sum = sum / sum(sum))
}

diff_data <-
  full_join(
    calc_bin_props(data = dat1),
    calc_bin_props(dat2),
    by = "bin") %>%
  separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
  mutate(dbinA = (sum.x - sum.y * 100)) %>%
  select(-starts_with("trsh"))

diff_data2 <-
  full_join(
    calc_bin_props(data = dat3),
    calc_bin_props(dat4),
    by = "bin") %>%
  separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
  mutate(dbinB = (sum.x - sum.y * 100)) %>%
  select(-starts_with("trsh"))

# Combine two differences, and remove sum.x and sum.y
full_data <- cbind(diff_data, diff_data2[,4])
full_data <- full_data[,-c(2:3)]

# Melt the data to plot more than 1 variable on a plot
m <- melt(full_data, id.vars="bin")

theme_update(plot.title = element_text(hjust = 0.5))
ggplot(m, aes(as.numeric(bin), value, col=variable, linetype = variable)) +
  geom_line() +
  scale_linetype_manual(values=c("solid", "longdash")) +
  scale_color_manual(values = c("black", "black"))
dev.off()
1
Can you please provide the data of either dat1 and dat2 or dbin. If you run dput(dbin) or if that object is very big instead do dput(head(dbin)) and copy + paste the ouput of that in an additional code snippet in your question. That will allow the readers of your question to provide a tested solution for your issue.Till
I added the output of what a couple replicated dats have for dbin @Tilluser2030765

1 Answers

1
votes
library(tidyverse)

Creating example data as shown in question, but adding different probabilities to the two sample() calls, to create so visible difference between the two sets of randomized data.

dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100))) %>% as_tibble()
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100))) %>% as_tibble()

Using dplyr we can handle this within data.frames (tibbles) without the need to switch to other datatypes.

Let’s define a function that can be applied to both datasets to get the preprocessing done.

We use base::cut() to create a new column that pairs each value with its bin. We then group the data by bin, calculate the sum for each bin and finally divide the bin sums by the total sum.

calc_bin_props <- function(data) {
  as_tibble(data) %>%
    mutate(bin = cut(a, breaks = seq(25, 50, by = 2), labels = seq(25, 48, by = 2))) %>%
    group_by(bin) %>%
    summarise(sum = sum(a), .groups = "drop") %>%
    filter(!is.na(bin)) %>% 
    ungroup() %>%
    mutate(sum = sum / sum(sum))
}

Now we call calc_bin_props() on both datasets and join them by bin. This gives us a dataframe with the columns bin, sum.x and sum.y. The latter two are correspond to the bin sums derived from dat1 and dat2. With the mutate() line we calculate the differences between the two columns.

diff_data <- 
  full_join(
    calc_bin_props(data = dat1),
    calc_bin_props(dat2),
    by = "bin") %>% 
  mutate(dbin = (sum.x - sum.y),
         bin = as.numeric(as.character(bin))) %>% 
  select(-starts_with("trsh"))

Before we feed the data into ggplot() we convert it to the long format using pivot_longer() this allows us to instruct ggplot() to plot the results for sum.x, sum.y and dbin as separate lines.

diff_data %>% 
  pivot_longer(-bin) %>% 
  ggplot(aes(as.numeric(bin), value, color = name, linetype = name)) +
  geom_line() +
  scale_linetype_manual(values=c("longdash", "solid", "solid")) +
  scale_color_manual(values = c("black", "purple", "green"))