0
votes

Any tips GREATLY appreciated. I started off with the following dataframe named years69_19: years69_19

then, I created a bunch of different dataframes, which contain data from years69_19 separated by "Fied" name. Here's an example of how I did this (some departments had multiple labels for the same department so I used the | operator to find all of them): separating by field

then, I put the new dataframes into a list called listofdeps. I also made a list with strings that correspond with listofdeps, which is just for the sake of titling the dataframes correctly. here's the list of dataframes and string labels

finally, i iterated over listofdeps, and pivoted each dataframe. Here's my code for that:

newlistofdeps = []

for dataframes, deptname in zip(listofdeps, depstrings):
    newlabel =  deptname + ' Department at [REDACTED]'
    dataframes[newlabel] = 1
    deptable = pd.pivot_table(dataframes[['Year', 'Gender', 'Ethnicity', newlabel]], index=['Gender', 'Ethnicity'], columns = ['Year'], aggfunc=np.sum, fill_value=0)
    newlistofdeps.append(deptable)

Now I have a list newlistofdeps, which has a dataframe for each department (field), and it looks like this: example of first dataframe in newlistofdeps

Stackoverflow community, I need help with the following:

  1. I need to reorder the Ethnicity index like so: 'Asian', 'Black', 'Chicano/Mexican-American', 'Other Hispanic/Latino', 'White', 'Other', 'Interational'. I've tried so many different approaches like df.reindex and using "level", but I just haven't been able to figure out how to do this.

  2. I need to make it so that for every dataframe in newlistofdeps, every ethnicity listed above appears, even if there were no rows with that ethnicity in that department. Here's an example of what I mean. enter image description here In this department, there weren't any Chicano/Mexican-American females or Black males. However, I still need rows for these groups, they would just all be filled with 0s. I actually have no clue how to go about this, I was thinking maybe creating a dataframe in this format with all the ethncities filled with 0s, and then left merging each dataframe with that dataframe so the missing ethnicities still have rows. any ideas?

Thank you!!!

1

1 Answers

0
votes

Seems like you're going the extremely long way of performing a crosstabulation. You can simply use pd.crosstab to do all of the heavy lifting you're performing manually.

Data creation

import pandas as pd
import numpy as np
import itertools

ethnicities = ['Asian', 'Black', 'Chicano/Mexican-American', 'Other Hispanic/Latino', 'White', 'Other', 'Interational']
fields = ["economics", "physics", "political sciences", "chemistry", "english"]
sexes = ["M", "F"]
years = [2000, 2001, 2002, 2003]

records = itertools.product(ethnicities, fields, sexes, years)
base_df = pd.DataFrame(records, columns=["ethnicity", "field", "sex", "year"])

print(base_df.head(10))

  ethnicity      field sex  year
0     Asian  economics   M  2000
1     Asian  economics   M  2001
2     Asian  economics   M  2002
3     Asian  economics   M  2003
4     Asian  economics   F  2000
5     Asian  economics   F  2001
6     Asian  economics   F  2002
7     Asian  economics   F  2003
8     Asian    physics   M  2000
9     Asian    physics   M  2001

The base_df is just that- the cartesian product of all of our categories. So in this variable, we have a row for every unique combination of ethnicity, field, sex, and year. Now that we have this, we can sample this dataframe to make our data more realistic. I'm going to undersample our data, to ensure that some combinations area entirely missing from the data to more closely resemble the data you're working with.

df = base_df.sample(50, replace=True)

print(df.head())
                 ethnicity               field sex  year
183                  White  political sciences   F  2003
228                  Other           chemistry   F  2000
38                   Asian             english   F  2002
166                  White           economics   F  2002
146  Other Hispanic/Latino           chemistry   M  2002

Now that we have a good example dataset, we can use pd.crosstab to get the tally that you're calculating in your problem. I'm doing to set the argument dropna=False This tells pandas to not discard combinations that are entirely missing, and instead fill in the missing observations with 0.

xtab = pd.crosstab(index=[df["field"], df["sex"], df["ethnicity"]], columns=df["year"], dropna=False)

print(xtab.head(10))
year                                    2000  2001  2002  2003
field     sex ethnicity                                       
chemistry F   Asian                        0     0     0     0
              Black                        0     0     0     0
              Chicano/Mexican-American     0     0     0     0
              Interational                 0     0     0     1
              Other                        1     0     0     0
              Other Hispanic/Latino        0     0     1     0
              White                        1     0     0     0
          M   Asian                        0     1     0     0
              Black                        0     0     0     0
              Chicano/Mexican-American     0     1     0     0

And there you have a cross tabulation of all of our categories, while also representing the missing categorical combinations.

For comparison, here's what happens when you set dropna=True (we will drop category combinations with 0 observations- such as the issue you're posting about).

xtab = pd.crosstab(index=[df["field"], df["sex"], df["ethnicity"]], columns=df["year"], dropna=True)

print(xtab.head(10))
year                                    2000  2001  2002  2003
field     sex ethnicity                                       
chemistry F   Interational                 0     0     0     1
              Other                        1     0     0     0
              Other Hispanic/Latino        0     0     1     0
              White                        1     0     0     0
          M   Asian                        0     1     0     0
              Chicano/Mexican-American     0     1     0     0
              Other Hispanic/Latino        1     2     1     0
              White                        0     1     0     1
economics F   Asian                        0     0     0     1
              Black                        0     1     0     0

Note that with dropna=True we're now missing certain categorical combinations since they were not observed in our sample.

To change the order of your rows, it'll be easiest to explicitly construct a new MultiIndex in the order that you expect and go from there.

# define the order of categories for each level
new_index = pd.MultiIndex.from_product([
    ["economics", "physics", "political sciences", "chemistry", "english"],
    ["M", "F"],
    ['Asian', 'Black', 'Chicano/Mexican-American', 'Other Hispanic/Latino', 'White', 'Other', 'Interational']],
    names=["field", "sex", "ethnicity"]
)

# use the new index to reorder the data
reordered_xtab = xtab.reindex(new_index)

print(reordered_xtab.head(10))
year                                    2000  2001  2002  2003
field     sex ethnicity                                       
economics M   Asian                        0     0     0     0
              Black                        0     0     0     0
              Chicano/Mexican-American     0     0     1     1
              Other Hispanic/Latino        0     0     0     0
              White                        0     1     1     0
              Other                        0     1     0     0
              Interational                 0     0     0     0
          F   Asian                        0     0     0     0
              Black                        0     0     0     0
              Chicano/Mexican-American     0     0     0     1

Now everything respects the order I defined within new_index, as opposed to alphabetical order which is what pandas defaulted to when calculating the crosstab.