0
votes

Suppose i have a set of points x,y to plot for an image with gnuplot.It works as expected and i get a nice curve.I want to repeat the experiment for a large dataset of images (say 1000).At this point you would get 1000 curves on one plot, each curve for one image.How do i tell gnuplot to draw a best fit of the curves?

I would like the gnuplot to give me the x,y point of the best fit curve in a csv as i plan to have a single plot of best fits later.

The data can be found here

2
What's the mathematical model for the fitting? If you have gnuplot draw a curve through a set of points, then my recollection is that it isn't finding any mathematical relationship between the points, but just using a cubic spline or other simple fuction to fit the curves on a point-by-point basis. You could somehow "average" these curves, but it wouldn't be a "best fit" in any mathematically rigorous sense. To have a best fit you have to have a mathematical model that you take your data to be a measurement of. Is your data related in some mathematical way?Kevin Boone
@KevinBoone with fit you can fit any function to your data.Christoph
@KevinBoone how do i do a good average of the curve then.I don't have the mathematical function.user2650277
Hard to say without looking at the data. A general approach is to divide the x range into small blocks (the number needed will depend on the number of data points), then take an average of the y values of all the data points that fall into each x region. Then plot a single line using the centres of the x regions and the corresponding y averages as the data points. To be honest, I can't remember whether gnuplot can do this calculation automatically. This approach gives the same kind of line you would draw if you just "eyeballed" the data; however, it has no mathematical validity at all.Kevin Boone
@KevinBoone the data and plots can be found @ encode.ru/threads/…user2650277

2 Answers

2
votes

If I understand you correctly you want to draw an average line through the data, rather than fitting the data for function. You can do this using the smooth option to the plot command.

Depending on your needs you could draw an interpolation function through your data. For example:

plot \
"libjpeg-2000-bench.png.csv" u 3:5 w p, \
"libjpeg-2000-mural.png.csv" u 3:5 w p, \
"libjpeg-2000-red-room.png.csv" u 3:5 w p, \
"libjpeg-bench.png.csv" u 3:5 w p, \
"libjpeg-mural.png.csv" u 3:5 w p, \
"libjpeg-red-room.png.csv" u 3:5 w p, \
 "< tail -q -n +4  libjpeg*csv" u 3:5 smooth acsplines   w l lw 2

gives

enter image description here

You might want to experiment with the various smoothing functions, see help smooth. Some of those functions also take additional parameters. For example, you can specify a weight for the acsplines interpolation:

plot \
"libjpeg-2000-bench.png.csv" u 3:5 w p, \
"libjpeg-2000-mural.png.csv" u 3:5 w p, \
"libjpeg-2000-red-room.png.csv" u 3:5 w p, \
"libjpeg-bench.png.csv" u 3:5 w p, \
"libjpeg-mural.png.csv" u 3:5 w p, \
"libjpeg-red-room.png.csv" u 3:5 w p, \
"< tail -q -n +4  libjpeg*csv" u 3:5:(100) smooth acsplines title "acsplines, weight = 100" w l lw 2,  \
"< tail -q -n +4  libjpeg*csv" u 3:5:(0.1) smooth acsplines title "acsplines, weight = 0.1" w l lw 2

enter image description here

The choice of the weight involves a trade-off: if the weight is large then the curve will follow the data points more closely, but will likely exhibit oscillations.

Alternatively you can bin the data points in the x direction, and average those data points that fall within the same bin. Luckily you can do all this from within gnuplot:

round(x) = floor(x+0.5)
bin(x,binwidth) = binwidth*round(x/binwidth)
binwidth = 1.
plot \
"libjpeg-2000-bench.png.csv" u 3:5 w p, \
"libjpeg-2000-mural.png.csv" u 3:5 w p, \
"libjpeg-2000-red-room.png.csv" u 3:5 w p, \
"libjpeg-bench.png.csv" u 3:5 w p, \
"libjpeg-mural.png.csv" u 3:5 w p, \
"libjpeg-red-room.png.csv" u 3:5 w p, \
 "< tail -q -n +4  libjpeg*csv"  u (bin($3,binwidth)):5 smooth uniq  w l lw 2

gives

enter image description here

Here you can adjust the binsize binwidth to your needs.

1
votes

I have to admit that it's not completely clear to me what exactly you want to achieve, nevertheless I have also the feeling that, as mentioned by @KevinBoone in the comments, you are trying to do some kind of binned statistic on the data. If this is the case, then Gnuplot is unfortunately not the proper tool for this task. In my opinion, it would be much more practical to delegate this processing task to something more appropriate.

As an example, let's say that the strategy would indeed be:

  1. load all the csv files in the current directory
  2. divide the x-range into M bins and calculate the average of the y-values that fall into each of the bins
  3. plot this "averaged" data

To this end, one might prepare a short Python script (which implements the steps outlined above) based on the binned_statistic function provided by the scipy toolkit. The required number of bins is passed as first argument, while the remaining arguments are interpreted as csv files for processing:

#!/usr/bin/env python
import sys

import numpy as np
from scipy.stats import binned_statistic

num_of_bins = int(sys.argv[1])

data = []
for fname in sys.argv[2:]:    
    with open(fname, 'r') as F:
        for line_id, line in enumerate(F):
            if line_id < 3: continue

            cols = line.strip().split(',')
            x, y = map(float, [cols[i] for i in [2, 3]])
            data.append((x, y))

data = np.array(data)
stat, bin_edges, _ = binned_statistic(data[:, 0], data[:, 1], 'mean', bins = num_of_bins, range = None)

for val, (lb, ub) in zip(stat, zip(bin_edges, bin_edges[1:])):
    print('%E,%E' % ( (lb+ub)/2, val ))

Now, in Gnuplot, we can invoke this script (lets say that it is stored in the current working directory as stat.py) externally and plot it together with the individual files:

set terminal pngcairo enhanced
set output 'fig.png'

#get all csv files in current directory as a space-delimited string
files = system("ls *.csv | xargs")

#construct a "pretty" label from the file name
getLabel(fname)=system(sprintf('echo "%s" | gawk -F"-" "BEGIN{OFS=\"-\"} {NF=NF-2;print}"', fname))

set datafile separator ","
set key spacing 1.5

LINE_WIDTH = 1.25
plot \
    for [filename in files] filename u 3:4 w l lw LINE_WIDTH t getLabel(filename), \
    sprintf('<python ./stat.py 20 %s', files) w l lw 3*LINE_WIDTH lc rgb 'red' t 'average'

With some of the sample data you provided in the comments, this produces: enter image description here

However, as pointed out by @KevinBoone, whether this "average" has a justifiable mathematical meaning in your specific setting is another question on its own...