0
votes

first a little background. I'm a psychology student so my background in coding isn't on par with you guys :-)

My problem is as follow and the most important observation is that curve fitting with 2 different programs gives completly different results for my parameters, altough my graphs stay the same. The main program we have used to fit my longitudinal data is kaleidagraph and this should be seen as kinda the 'golden standard', the program I'm trying to modify is matlab.

I was trying to be smart and wrote some code (a lot at least for me) and the goal of that code was the following: 1. Taking an individual longitudinal datafile 2. curve fitting this data on a non-parametric model using lsqcurvefit 3. obtaining figures and the points where f' and f'' are zero

This all worked well (woohoo :-)) but when I started comparing the function parameters both programs generate there is a huge difference. The kaleidagraph program stays close to it's original starting values. Matlab wanders off and sometimes gets larger by a factor 1000. The graphs stay however more or less the same in both situations and both fit the data well. However it would be lovely if I would know how to make the matlab curve fitting more 'conservative' and more located near it's original starting values.

validFitPersons = true(nbValidPersons,1);
    for i=1:nbValidPersons
        personalData = data{validPersons(i),3};
        personalData = personalData(personalData(:,1)>=minAge,:);
        % Fit a specific model for all valid persons
        try
            opts = optimoptions(@lsqcurvefit, 'Algorithm', 'levenberg-marquardt'); 
            [personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
        catch
            x=1;
        end

Above is a the part of the code i've written to fit the datafiles into a specific model. Below is an example of a non-parametric model i use with its function parameters.

elseif strcmpi(model,'jpa2')
    % y = a.*(1-1/(1+(b_1(t+e))^c_1+(b_2(t+e))^c_2+(b_3(t+e))^c_3))
    heightModel = @(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
    modelStrings = {'a','b1','b2','b3','c1','c2','c3','e'};

    % Define initial values
    if strcmpi('male',gender)
        initialValues = [176.76 0.339 0.1199 0.0764 0.42287 2.818 18.52 0.4363];
    else
        initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
    end    

I've tried to mimick the curve fitting process in kaleidagraph as good as possible. There I've found they use the levenberg-marquardt algorithm which I've selected. However results still vary and I don't have any more clues about how I can change this.


Some extra adjustments:

The idea for this code was the following:

I'm trying to compare different fitting models (they are designed for this purpose). So what I do is I have 5 models with different parameters and different starting values ( the second part of my code) and next I have the general curve fitting file. Since there are different models it would be interesting if I could put restrictions into how far my starting values could wander off.

Anyone any idea how this could be done?


Anybody willing to help a psychology student?

Cheers

2
Since you already posted the model and starting parameters, it may actually suffice if you would provide: (1) some measure in the uncertainty in y (the personal data) and (2) the range of t.Buck Thorn
Have you looked into the covariance matrix of your output fitting parameters, maybe the parameters are strongly correlated?Buck Thorn
the range is over 20 years uncertainty in y is determinated by the sum of squares (if that's what you mean). The data are longitudinal and give the height of an individual at their birthdays.user2694285
you mean age = 20 - xxx years? I plotted the equation you provided and above ~16 years it hardly changes compared to <16. By uncertainty I mean: can you estimate the uncertainty in the values of y which you measured?Buck Thorn
*t is the age and mostly stops around 20 years (growth is mostly complete and measurements are stopped). Since this is the data after age 16 most of the time isn't going to change muchuser2694285

2 Answers

1
votes

This is a common issue when dealing with non-linear models.

If I were, you, I would try to check if you can remove some parameters from the model in order to simplify it.

If you really want to keep your solution not too far from the initial point, you can use upper bounds and lower bounds for each variable:

x = lsqcurvefit(fun,x0,xdata,ydata,lb,ub)

defines a set of lower and upper bounds on the design variables in x so that the solution is always in the range lb ≤ x ≤ ub.

Cheers

0
votes

You state:

I'm trying to compare different fitting models (they are designed for this purpose). So what I do is I have 5 models with different parameters and different starting values ( the second part of my code) and next I have the general curve fitting file.

You will presumably compare the statistics from fits with different models, to see whether reductions in the fitting error are unlikely to be due to chance. You may want to rely on that comparison to pick the model that not only fits your data suitably but is also simplest (which is often referred to as the principle of parsimony).

The problem is really with the model you have shown resulting in correlated parameters and therefore overfitting, as mentioned by @David. Again, this should be resolved when you compare different models and find that some do just as well (statistically speaking) even though they involve fewer parameters.

edit

To drive the point home regarding the problem with the choice of model, here are (1) results of a trial fit using simulated data (2) the correlation matrix of the parameters in graphical form:

enter image description here

enter image description here

Note that absolute values of the correlation close to 1 indicate strongly correlated parameters, which is highly undesirable. Note also that the trend in the data is practically linear over a long portion of the dataset, which implies that 2 parameters might suffice over that stretch, so using 8 parameters to describe it seems like overkill.