I know there has been many a question on this error message. However I haven't found one with this exact problem.
I'm trying to group a pandas DataFrame and count the values:
allfactor = dataframe.groupby(factor)[reference_area].value_counts()
where factor and reference_area are column names in the dataframe.This works for some columns such as DGD015, but not for some others including factor.
It gives me the error:
ValueError: operands could not be broadcast together with shape (421,) (419,)
I'll put the complete error message at the end of this question.
Grouping itself works:
In: grouped = data.groupby(factor)[reference_area]
Out: <pandas.core.groupby.generic.SeriesGroupBy object at 0x0000000B39D0F5F8>
I can see that it's a numpy broadcasting error that occurs because the dimensions don't have the same shape. And there are workarounds for that, such as using [:, np.newaxis]
(Research Computing for Earth Sciences) or [:,None]
(How to think like a Computer Scientist: Learning with Python 3) when trying to multiply dimensions that don't "fit" and from which none can be stretched.
However, I don't know how to do this when the error occurs in numpy, which is called by pandas, which is called by calling value_counts().
Does anyone have an idea for a workaround here?
How can I access numpy here to tell it to just add new axes containing NANs to make the dimensions fit?
Here's the complete error message:
ValueError Traceback (most recent call last)
ipython-input-5-013b5262b34f> in module>()
----> 1 trial = get_positives_threshold(data, 'SHB23D', 'HV001', threshold=90)
2 print(trial)
ipython-input-3-80d69965e883> in get_positives_threshold(dataframe, factor, reference_area, threshold)
---> 33 allfactor = dataframe.groupby(factor)[reference_area].value_counts()
~\Documents\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in value_counts(self, normalize, sort, ascending, bins, dropna)
-> 1139 labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]`
`~\Documents\anaconda3\lib\site-packages\numpy\core\fromnumeric.py in repeat(a, repeats, axis)
421 repeated_array : ndarray
422 Output array which has the same shape as a, except along
--> 423 the given axis.
~\Documents\anaconda3\lib\site-packages\numpy\core\fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
50 try:
51 return getattr(obj, method)(*args, **kwds)
---> 52
53 # An AttributeError occurs if the object does not have
54 # such a method in its class.`
ValueError: operands could not be broadcast together with shape (421,) (419,)
Here's some info on the dataframe:
Originally was a .sav SPSS file that was converted to a feather file. That was then read in by using pandas.read_feather(path_to_file). The data of all column is of dtype categorical. Most columns original values contain NaNs, integers as strings, and strings but all of those are stored as type categorical.
reference_area HV002 HV003 [...] DGD015 [...] factor [...]
1 '10001' 'NaN' 'Yes' [...] 'Refused' [...] '90' [...]
2 '10001' 'No' 'NaN' [...] '140' [...] '80' [...]
3 '24736' 'Yes' 'No' [...] '78' [...] 'Nan' [...]
4 '24736' 'Yes' 'No' [...] 'Other' [...] 'Technical Problem'
Values are representative but mixed and column names changed to mask the original data.
Pandas version 0.24.1
Numpy version 1.15.4
Python version 3.6.5
Working with Anaconda 3 in jupyter notebook with said versions in my environment.
Expected output:
In: dataframe.groupby(factor)[reference_area].value_counts()
Out: factor reference_area
0 121640.0 1
1 52675.0 1
181826.0 1
10 40812.0 1
340804.0 2
360756.0 1
100 70679.0 18
70251.0 14
70019.0 13
70728.0 11
120070.0 11
Refused 90008.0 1
problem and notvalue_count
. You need to check the dimensions of the input – user8408080pandas\core\groupby\generic.py in value_counts(self, normalize, sort, ascending, bins, dropna)
...?<br> How do I check the dimensions? That's something else I haven't figured out yet.<br> The DataFrame is very large with over 2 million rows and hundreds of columns. – Sally.shape
attribute. Just tryfactors.shape
for example – user8408080