2
votes

I need to make a histogram, and my data points each carry a statistical weight. The standard hist function isn't equipped to handle this. I could of course import the numpy.histogram function, which handles weighted data just fine, but I thought it would be a good exercise in learning julia to try and augment the hist() function to accept weights as an optional (named) argument.

I started by looking at the julia source for hist(), and was able to modify it slightly (if amateurishly -- suggestions for improvements welcome), to get it sort of working:

function sturges(n)  # Sturges' formula
    n==0 && return one(n)
    iceil(log2(n))+1
end

function weightedhist!{HT}(h::AbstractArray{HT}, v::AbstractVector, edg::AbstractVector; init::Bool=true, weights::AbstractVector = ones(HT,length(v)))
    n = length(edg) - 1
    length(weights) == length(v) || error("length(weights) must equal length(v)")
    length(h) == n || error("length(h) must equal length(edg) - 1.")
    if init
        fill!(h, zero(HT))
    end
    for j=1:length(v)
        i = searchsortedfirst(edg, v[j])-1
        if 1 <= i <= n
            h[i] += weights[j]
        end
    end
    edg, h
end

weightedhist(v::AbstractVector, edg::AbstractVector; weights::AbstractVector = ones(Int,length(v))) = weightedhist!(Array(Float64, length(edg)-1), v, edg; weights=weights)
weightedhist(v::AbstractVector, n::Integer; weights::AbstractVector = ones(Int,length(v))) = weightedhist(v, histrange(v,n); weights=weights)
weightedhist(v::AbstractVector; weights::AbstractVector = ones(Int,length(v))) = weightedhist(v, sturges(length(v)); weights=weights)

If I generate some random data with

v = randn(10^5);
w = rand(length(v));
edges = floor(minimum(v)):0.1:ceil(maximum(v));

then weightedhist(v, edges; weights=w) agrees with numpy.histogram(v, edges, weights=w). If I leave out the optional keyword argument for weights, then weightedhist(v, edges) agrees with the built in hist(v, edges), and weightedhist(v) agrees with the built in hist(v), except for the fact that my function outputs floats rather than ints when no weights are provided.

I don't understand why this is the case (is h getting created as a float array? promoted?), and I'd like for the my function to fall back on the behavior of the built in one as closely as possible when no weights are provided.

Can anyone suggest why my function is outputting floats, and how I might change that behavior to output ints when no weights are provided? I'd like to do this without first creating the h array and then converting it from one type to another, since I'd like the code to be as fast as possible.

1

1 Answers

2
votes

If I understand correctly, when you call

weightedhist(v, edges)

you are using the first of your three "extra" definitions at the bottom.

This calls

weightedhist!(Array(Float64, length(edg)-1), v, edg; weights=weights)

so in your "main" weightedhist! the HT parameterization will be Float64, so h will be filled with HT == Float64, hence the Float64 output. So changing it to Array(eltype(weights), length(edg)-1) would be sufficient, I believe.