1
votes

[EDIT: The fmin() method is a good choice for my problem. However, my problem was that one of the axes was a sum of the other axes. I wasn't recalculating the y axis after applying the multiplier. Thus, the value returned from my optimize function was always returning the same value. This gave fmin no direction so it's chosen multipliers were very close together. Once the calculations in my optimize function were corrected fmin chose a larger range.]

I have two datasets that I want to apply multipliers to to see what values could 'improve' their correlation coefficients.

For example, say data set 1 has a correlation coefficient of -.6 and data set 2 has .5.

I can apply different multipliers to each of these data sets that might improve the coefficient. I would like to find a set of multipliers to choose for these two data sets that optimizing the correlation coefficients of each set.

I have written an objective function that takes a list of multipliers, applies them to the data sets, calculates the correlation coefficient (scipy.stats.spearmanr()), and sums these coefficients. So I need to use something from scipy.optimize to pass a set of multipliers to this function and find the set that optimizes this sum.

I have tried using optimize.fmin and several others. However, I want the optimization technique to use a much larger range of multipliers. For example, my data sets might have values in the millions, but fmin will only choose multipliers around 1.0, 1.05, etc. This isn't a big enough value to modify these correlation coefficients in any meaningful way.

Here is some sample code of my objective function:

def objective_func(multipliers):
    for multiplier in multipliers:
        for data_set in data_sets():
            x_vals = getDataSetXValues()
            y_vals = getDataSetYValues()
            xvals *= muliplier
            coeffs.append(scipy.stats.spearmanr(x_vals, y_vals)

    return -1 * sum(coeffs)

I'm using -1 because I actually want the biggest value, but fmin is for minimization.

Here is a sample of how I'm trying to use fmin:

print optimize.fmin(objective_func)

The multipliers start at 1.0 and just range between 1.05, 1.0625, etc. I can see in the actual fmin code where these values are chosen. I ultimately need another method to call to give the minimization a range of values to check for, not all so closely related.

2

2 Answers

1
votes

Multiplying the x data by some factor won't really change the Spearman rank-order correlation coefficient, though.

>>> x = numpy.random.uniform(-10,10,size=(20))
>>> y = numpy.random.uniform(-10,10,size=(20))
>>> scipy.stats.spearmanr(x,y)
    (-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*10,y)
    (-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*1e6,y)
    (-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*1e-16,y)
    (-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*(-2),y)
    (0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*(-2e6),y)
    (0.24661654135338346, 0.29455199407204263)

(The second term in the tuple is the p value.)

You can change its sign, if you flip the signs of the terms, but the whole point of Spearman correlation is that it tells you the degree to which any monotonic relationship would capture the association. Probably that explains why fmin isn't changing the multiplier much: it's not getting any feedback on direction, because the returned value is constant.

So I don't see how what you're trying to do can work.

I'm also not sure why you've chosen the sum of all the the Spearman coefficients and the p values as what you're trying to maximize: the Spearman coefficients can be negative, so you probably want to square them, and you haven't mentioned the p values, so I'm not sure why you're throwing them in.

[It's possible I guess that we're working with different scipy versions and our spearmanr functions return different things. I've got 0.9.0.]

0
votes

You probably don't want to minimize the sum of coefficients but the sum of squares. Also, if the multipliers can be chosen independently, why are you trying to optimize them all at the same time? Can you post your current code and some sample data?