2
votes

I know there has been many a question on this error message. However I haven't found one with this exact problem.

I'm trying to group a pandas DataFrame and count the values:

allfactor = dataframe.groupby(factor)[reference_area].value_counts()

where factor and reference_area are column names in the dataframe.This works for some columns such as DGD015, but not for some others including factor.
It gives me the error:

ValueError: operands could not be broadcast together with shape (421,) (419,)

I'll put the complete error message at the end of this question.
Grouping itself works:

In: grouped = data.groupby(factor)[reference_area]
    grouped
Out: <pandas.core.groupby.generic.SeriesGroupBy object at 0x0000000B39D0F5F8>

I can see that it's a numpy broadcasting error that occurs because the dimensions don't have the same shape. And there are workarounds for that, such as using [:, np.newaxis] (Research Computing for Earth Sciences) or [:,None] (How to think like a Computer Scientist: Learning with Python 3) when trying to multiply dimensions that don't "fit" and from which none can be stretched.

However, I don't know how to do this when the error occurs in numpy, which is called by pandas, which is called by calling value_counts().

Does anyone have an idea for a workaround here?

How can I access numpy here to tell it to just add new axes containing NANs to make the dimensions fit?

Here's the complete error message:

ValueError          Traceback (most recent call last)
ipython-input-5-013b5262b34f> in module>()
----> 1 trial = get_positives_threshold(data, 'SHB23D', 'HV001', threshold=90)
      2 print(trial)
ipython-input-3-80d69965e883> in get_positives_threshold(dataframe, factor, reference_area, threshold)
---> 33         allfactor = dataframe.groupby(factor)[reference_area].value_counts()
~\Documents\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in value_counts(self, normalize, sort, ascending, bins, dropna)
-> 1139         labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]`
`~\Documents\anaconda3\lib\site-packages\numpy\core\fromnumeric.py in repeat(a, repeats, axis)
    421     repeated_array : ndarray
    422         Output array which has the same shape as a, except along
--> 423         the given axis.
~\Documents\anaconda3\lib\site-packages\numpy\core\fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     50     try:
     51         return getattr(obj, method)(*args, **kwds)
---> 52 
     53     # An AttributeError occurs if the object does not have
     54     # such a method in its class.`


ValueError: operands could not be broadcast together with shape (421,) (419,)

Here's some info on the dataframe:

Originally was a .sav SPSS file that was converted to a feather file. That was then read in by using pandas.read_feather(path_to_file). The data of all column is of dtype categorical. Most columns original values contain NaNs, integers as strings, and strings but all of those are stored as type categorical.

  reference_area   HV002   HV003  [...] DGD015    [...] factor    [...]
1 '10001'          'NaN'   'Yes'  [...] 'Refused' [...] '90'      [...]
2 '10001'          'No'    'NaN'  [...] '140'     [...] '80'      [...]
3 '24736'          'Yes'   'No'   [...] '78'      [...] 'Nan'     [...]
4 '24736'          'Yes'   'No'   [...] 'Other'   [...] 'Technical Problem'

Values are representative but mixed and column names changed to mask the original data.

Pandas version 0.24.1
Numpy version 1.15.4
Python version 3.6.5
Working with Anaconda 3 in jupyter notebook with said versions in my environment.

Expected output:

In: dataframe.groupby(factor)[reference_area].value_counts()
Out: factor  reference_area 
0                  121640.0     1
1                  52675.0      1
                   181826.0     1
10                 40812.0      1
                   340804.0     2
                   360756.0     1
100                70679.0     18
                   70251.0     14
                   70019.0     13
                   70728.0     11
                   120070.0    11
                               ..
Refused            90008.0      1
1
I think it's rather a groupby problem and not value_count. You need to check the dimensions of the inputuser8408080
Hi, first: Thank you for the quick response! The error message tells me it is in groupby but the error occurs within the function value_counts: pandas\core\groupby\generic.py in value_counts(self, normalize, sort, ascending, bins, dropna) ...?<br> How do I check the dimensions? That's something else I haven't figured out yet.<br> The DataFrame is very large with over 2 million rows and hundreds of columns.Sally
You've clearly put a lot of work into writing your question and providing the complete stack trace, so thank you for that. However the key here (as in most Pandas questions) is a representative sample of your data - could you share some? You can obviously replace the values with made-up ones.Josh Friedlander
Every numpy array has a .shape attribute. Just try factors.shape for exampleuser8408080
@Mortz: The other way around: The number of various values for factor per each reference area. I'll make an edit to show you the expected output (as it works with other columns).Sally

1 Answers

4
votes

The problem seems to be with unobserved categories. From the pandas documentation on groupby:

When using a Categorical grouper (as a single grouper, or as part of multiple groupers), the observed keyword controls whether to return a cartesian product of all possible groupers values (observed=False) or only those that are observed groupers (observed=True).

Calculating the cartesian product in my specific case ultimately results in the broadcasting error. That is why some columns work while others don't: Those columns that work do not have any unobserved categories, while those that do not work have unobserved categories.

To avoid trouble with this, set observed = True when grouping. This means groupby will only use observed categories (i.e. those categories for which entries exist). In my case that would be:
allfactor = dataframe.groupby(factor, observed=True)[reference_area].value_counts()

As far as my testing shows, this does not lead to loosing entries of the dataframe for further analysis. There are no entries for unobserved categories (not even with NaN values), so we do not loose any entries in using only observed categories. Be warned though, that if you do want to analyse these unobserved categories this is not the solution you are looking for.