Create dataframe from weighted data for data visualization

Question

I have a created a data frame with a post stratification survey weight variable. My visualization team needs a data frame where each row is an observation. They want to visualize data with tools other than R.

Of course my unweighted data is one observation per row. But, that's unweighted data that has a different answer distribution on a given variable than what the weighted data have.

I looked into the 'synthpop' package, but it doesn't look like it does what I need. I can't figure out if there's another package that could help or what other code to use. I can manually calculate some of these, but it's a pain and not easily reproducible.

Don't currently have any

The expected data frame would have one observation per row, but each variable would have the distribution I'd get from running a weighted univariate analysis using the survey weight.

I'd like to try to help, but this is not enough information. Do you have any code? Any example data or desired results? — Jon Spring
Hi Jon, I can't share data, but it'd look pretty much like any weighted survey data. A bunch of columns/variables one of which is a survey weight that tells the function how much weight to give each observation. Desired results is a dataframe where I don't have to rely on the survey weight. So, if I ran a table function on the dataframe I'd get the same results as a weighted table from the weighted data frame. — Brian Head
You're more likely to get help sooner if you can put something down that others can run and manipulate. Prose alone is imprecise and often leads to misunderstandings. Doesn't need to be real data, just functionally similar. See here: stackoverflow.com/questions/5963269/… — Jon Spring

TimTeaFan TimTeaFan · Accepted Answer · 2019-09-24T20:07:25

This is an interesting question, especially when differentiating between the technical question of how to create a new dataframe from weighted data for visualization and the rather general issue of how weighted data should be graphically represented (in the first place).

I understand that you have already answered the latter question with the result to produce a new dataframe from your weighted data. In this weighted data, each row represents an actual observation, for example, a participant who has actually participated in the survey. A new "unweighted" dataframe would need to contain more rows with similar answers based on the weight of each participant.

Before generating some data and looking at some examples, it should be noted that the graphical representation of weighted survey data is not an issue as long as it is based on aggregated measures / items. For example a bar chart showing an average (mean) of some measure X can be graphically represented pretty much the same way as unweighted data would be represented. The only difference is that the aggregated values are calculated differently (we would use a weighted mean and not the normal mean function in R).

So the problem of generating a new dataframe for visualizing weighted data only arises for some certain types of plots. Here a scatterplot comes quick to mind, since it shows all actual data points.

So let’s think about how a scatterplot could be drawn for weighted data.

First let’s generate some data.

library(dplyr)
library(tidyr)
library(ggplot2)

# Function to randomly generate weights
set.seed(7)
rtnorm <- function(n, mean, sd, a = -Inf, b = Inf){
  qnorm(runif(n, pnorm(a, mean, sd), pnorm(b, mean, sd)), mean, sd)
}

n <- 1000
data <- data.frame(  age = round(runif(n, min = 18, max = 80), 0),
                   score = round(runif(n, min = 0, max = 100), 0),
                  weight = round(rtnorm(n = n, mean = 1, sd = 1, a = 0.05, b = 10), 1))

Let’s look at participants with a certain age from 18 to 80 who answered a measurement item on a 0 to 100 scale (which I will just call 'score') and those participants are more or less representative of the underlying population as indicated by their 'weight'.

Now let’s look at the data without considering the weights:

data %>% 
  ggplot(., aes(x = age, y = score)) +
  geom_point()

The scatterplot shows all data points, where each point represents a participant who participated in the survey. However, the scatterplot does not reveal how the data would look like in the underlying population.

We could now mutate the dataframe by generating more rows of those participants who have a larger weight.

data %>% 
  mutate(weight2 = weight * (1/min(weight))) %>% 
  uncount(weight2) %>% 
  ggplot(., aes(x = age, y = score)) +
  geom_point()

However, this scatterplot looks exactly the same like the first. Thinking about it, although more observations have been generated, they are located at the exact same x and y coordinates.

To get a better feeling for which points are more representative of the underlying population it is advisable to consider the weight to be another dimension of our plot. We could just add the variable in the aes() statement of our ggplot call without creating a new dataframe based on weight.

data %>% 
  ggplot(., aes(x = age, y = score, alpha = weight)) +
  geom_point()

Instead of considering weight to be another dimension of our plot, we could actually transform our dataframe according to weight and use ggplot alpha to make the overlapping points visible.

data %>% 
  mutate(weight2 = weight * (1/min(weight))) %>% 
  uncount(weight2) %>% 
  ggplot(., aes(x = age, y = score)) +
  geom_point(alpha = 0.035)

As you can see, the result is the same.

I don’t know which plot type your visualization team is trying to make, but I would recommend to consider weight as a further dimension of your plot and refrain from creating new data.

Create dataframe from weighted data for data visualization

1 Answers