This is an interesting question, especially when differentiating between the technical question of how to create a new dataframe from weighted data for visualization and the rather general issue of how weighted data should be graphically represented (in the first place).
I understand that you have already answered the latter question with the result to produce a new dataframe from your weighted data. In this weighted data, each row represents an actual observation, for example, a participant who has actually participated in the survey. A new "unweighted" dataframe would need to contain more rows with similar answers based on the weight of each participant.
Before generating some data and looking at some examples, it should be noted that the graphical representation of weighted survey data is not an issue as long as it is based on aggregated measures / items. For example a bar chart showing an average (mean) of some measure X can be graphically represented pretty much the same way as unweighted data would be represented. The only difference is that the aggregated values are calculated differently (we would use a weighted mean and not the normal mean function in R).
So the problem of generating a new dataframe for visualizing weighted data only arises for some certain types of plots. Here a scatterplot comes quick to mind, since it shows all actual data points.
So let’s think about how a scatterplot could be drawn for weighted data.
First let’s generate some data.
library(dplyr)
library(tidyr)
library(ggplot2)
# Function to randomly generate weights
set.seed(7)
rtnorm <- function(n, mean, sd, a = -Inf, b = Inf){
qnorm(runif(n, pnorm(a, mean, sd), pnorm(b, mean, sd)), mean, sd)
}
n <- 1000
data <- data.frame( age = round(runif(n, min = 18, max = 80), 0),
score = round(runif(n, min = 0, max = 100), 0),
weight = round(rtnorm(n = n, mean = 1, sd = 1, a = 0.05, b = 10), 1))
Let’s look at participants with a certain age from 18 to 80 who answered a measurement item on a 0 to 100 scale (which I will just call 'score') and those participants are more or less representative of the underlying population as indicated by their 'weight'.
Now let’s look at the data without considering the weights:
data %>%
ggplot(., aes(x = age, y = score)) +
geom_point()
The scatterplot shows all data points, where each point represents a participant who participated in the survey. However, the scatterplot does not reveal how the data would look like in the underlying population.
We could now mutate the dataframe by generating more rows of those participants who have a larger weight.
data %>%
mutate(weight2 = weight * (1/min(weight))) %>%
uncount(weight2) %>%
ggplot(., aes(x = age, y = score)) +
geom_point()
However, this scatterplot looks exactly the same like the first. Thinking about it, although more observations have been generated, they are located at the exact same x and y coordinates.
To get a better feeling for which points are more representative of the underlying population it is advisable to consider the weight to be another dimension of our plot. We could just add the variable in the aes()
statement of our ggplot call without creating a new dataframe based on weight.
data %>%
ggplot(., aes(x = age, y = score, alpha = weight)) +
geom_point()
Instead of considering weight to be another dimension of our plot, we could actually transform our dataframe according to weight and use ggplot alpha to make the overlapping points visible.
data %>%
mutate(weight2 = weight * (1/min(weight))) %>%
uncount(weight2) %>%
ggplot(., aes(x = age, y = score)) +
geom_point(alpha = 0.035)
As you can see, the result is the same.
I don’t know which plot type your visualization team is trying to make, but I would recommend to consider weight as a further dimension of your plot and refrain from creating new data.