I need to make a histogram, and my data points each carry a statistical weight. The standard hist
function isn't equipped to handle this. I could of course import the numpy.histogram function, which handles weighted data just fine, but I thought it would be a good exercise in learning julia to try and augment the hist()
function to accept weights as an optional (named) argument.
I started by looking at the julia source for hist(), and was able to modify it slightly (if amateurishly -- suggestions for improvements welcome), to get it sort of working:
function sturges(n) # Sturges' formula
n==0 && return one(n)
iceil(log2(n))+1
end
function weightedhist!{HT}(h::AbstractArray{HT}, v::AbstractVector, edg::AbstractVector; init::Bool=true, weights::AbstractVector = ones(HT,length(v)))
n = length(edg) - 1
length(weights) == length(v) || error("length(weights) must equal length(v)")
length(h) == n || error("length(h) must equal length(edg) - 1.")
if init
fill!(h, zero(HT))
end
for j=1:length(v)
i = searchsortedfirst(edg, v[j])-1
if 1 <= i <= n
h[i] += weights[j]
end
end
edg, h
end
weightedhist(v::AbstractVector, edg::AbstractVector; weights::AbstractVector = ones(Int,length(v))) = weightedhist!(Array(Float64, length(edg)-1), v, edg; weights=weights)
weightedhist(v::AbstractVector, n::Integer; weights::AbstractVector = ones(Int,length(v))) = weightedhist(v, histrange(v,n); weights=weights)
weightedhist(v::AbstractVector; weights::AbstractVector = ones(Int,length(v))) = weightedhist(v, sturges(length(v)); weights=weights)
If I generate some random data with
v = randn(10^5);
w = rand(length(v));
edges = floor(minimum(v)):0.1:ceil(maximum(v));
then weightedhist(v, edges; weights=w)
agrees with numpy.histogram(v, edges, weights=w)
. If I leave out the optional keyword argument for weights, then weightedhist(v, edges)
agrees with the built in hist(v, edges)
, and weightedhist(v)
agrees with the built in hist(v)
, except for the fact that my function outputs floats rather than ints when no weights are provided.
I don't understand why this is the case (is h getting created as a float array? promoted?), and I'd like for the my function to fall back on the behavior of the built in one as closely as possible when no weights are provided.
Can anyone suggest why my function is outputting floats, and how I might change that behavior to output ints when no weights are provided? I'd like to do this without first creating the h array and then converting it from one type to another, since I'd like the code to be as fast as possible.