0
votes

I have to calculate the distance of all UK postcodes against each other, then sum the population of all postcodes within 1 mile. The postcode & population list are stored in a text file. I'm most familiar with matlab, but I also have Stata & PSPP available. The program is currently scheduled to take about 2 weeks. Is there anything I can do to accelerate this process??? Here is my code. Matlab generated the script to import the text data. The distance function is from the mapping toolbox, and performs the great circle formula.

Any help is greatly appreciated.

function pcdistance(postcode, pop, lat, lon)
%Finds total population for UK postcode within 1 mile radius


fid = fopen('PPC.txt','a');

n = length(postcode);

%Calculates distance of 1 postcode at a time, against all others
%All data that doesn't meet rules is deleted


for i = 1:n;

    dist = [];
    dist(:,1)= pop;

    for j = 1:n;
        dist(j,2) = distance(lat(i),lon(i),lat(j),lon(j),3963.17);
        good = dist(1:j,2)<= 1;
    end

    dist = dist(good,:);
    tot = sum(dist(:,1));


    fprintf(1,'%s,%d;',postcode{i},tot)

end

%Find sum of population within 1 mile

fclose(fid);

end

Here's a small sample of the inputs, from the txt file. The columns are "postcode, pop, lat, long" respectively.

"BD7 1DB",749,53.79,-1.76
"M15 6AA",748,53.46,-2.24
"WR2 6AJ",748,52.19,-2.24
"M15 6PF",745,53.46,-2.23
"IP7 7RA",741,52.12,0.96
"CF62 4WA",740,51.41,-3.41
"M1 2AR",738,53.47,-2.22
"NG1 4BR",737,52.95,-1.14
"ST16 3AW",735,52.81,-2.11
"AB25 1LE",733,57.15,-2.10
"WF2 9AG",730,53.68,-1.50
"DT11 8RH",730,50.86,-2.12
"CW1 5NP",729,53.09,-2.41
"TR12 7RH",724,50.08,-5.25
"ST5 5DY",723,53.00,-2.27
"HA1 3HP",723,51.57,-0.33
"DL10 7NP",722,54.37,-1.62
"M1 7HR",719,53.47,-2.23
"B18 4AS",719,52.49,-1.93
"OX13 6JB",716,51.68,-1.30

Here's the corrected code.

function pcdistance4(postcode, pop, lat, lon)
%Finds total population for UK postcode within 1 mile radius


fid = fopen('PPC.txt','A');

n = length(postcode);


% Pre-allocation
dist = zeros(n,2);
tot = zeros(n,1);

tic

for i = 1:n;


    dist(:,1)= pop;

    dist(:,2) = distance(lat(i),lon(i),lat(:),lon(:),3963.17);

      good = dist(:, 2) <= 1 & dist(:,2) ~=0;

    tot(i) = sum(dist(good, 1));
    tot(i) = tot(i) + pop(i);

end

toc
tic

for j = 900001:n;
    fprintf(fid,'%s,%d;\n',postcode{j},tot(j));
end

toc

fclose(fid);

end
4
Could you provide sample inputs for the function? - Dev-iL
I added sample inputs to the end of the post - user3897165
Thanks everyone. You guys shortened the time to under 4 days. I took suggestions from everyone, but I gave gire the check because he gave the most help. My corrected code is listed in the answer... - user3897165

4 Answers

3
votes

Just a few general tips to get you started:

  1. For your personal education: run your code using profiler to see where the majority of the computational time is spent. This will be the first clue on where to begin your optimization. (link to doc)

  2. You shouldn't be writing to the hard drive on every step of the loop, since I\O are very costly in computation time. Instead you should save a bunch of strings to memory and write these "chunks" every once in a while. link1 link2

  3. You can try using parfor instead of for(link to doc). Or possibly even CUDA if available to you. (link to doc)

  4. Consider using the Geodetic Toolbox. It could be easier to convert the lat\lng coordinates to UTM (i.e. cartesian) and then use some standard functions to find distances.

Also:

  • My suggestion: don't use i and j as indices for your loops - this is often considered bad practice (due to possible confusion with imaginary numbers).
1
votes

You should consider memory-complexity as well. Consider pre-allocating the variable dist outside the for-loops and overwrite it element by element.

For example (see comments in the modified function):

function pcdistance(postcode, pop, lat, lon)

fid = fopen('PPC.txt','a');

n = length(postcode);

% Pre-allocation
dist = zeros(n,n);

for i = 1:n;

    % Avoid "deleting" the variable, you can overwrite it as the number of
    % elements is always the same
    % dist = [];
    dist(:,1)= pop;

    for j = 1:n;

        % Unfortunately I do tno have the mentioned toolbox, but there is a
        % high chance that you can avoid the for-loop. Probabily something
        % like:
        %    dist(:, 2) = distance(lat(i), lon(i), lat, lon, ...)
        % Try to vectorize it.
        dist(j,2) = distance(lat(i),lon(i),lat(j),lon(j),3963.17);

        % There is no need for this operation, is highly redundant and
        % computationally expensive:
        %   - in the first loop you will check 1
        %   - in the second loop you will check two elements (1 redundant)
        %   - in the jth loop you will check j elements (j-1 redundant)
        % The total redundant operations are 1+2+3+...+n-1.
        %good = dist(1:j,2)<= 1;
    end

    % better do this
    good = dist(:, 2) <= 1;

    % also memory expensive.
    % dist = dist(good,:);

    % Better do the indexing directly
    tot = sum(dist(good, 1));

end

% Write outside as recommended by Dev-iL

%Find sum of population within 1 mile

fclose(fid);

end
1
votes

I cannot believe that distance() can only compare two points at a time when there should be no problem in vectorizing some sinus and cosine functions. So here is a trimmed vectorized version which i wrote for my own purposes some time ago. Maybe because i did not have that toolbox or i did not know about it. Frankly it does not give the exact same result as distance(), which i just tested. If you need the exact result of distance() you better not use this vectorized version.

function dist = distance_on_earth(lat0, lon0, lats, lons, radius)
degree2radians = pi/180;

% phi = 90 - latitude
phi0 = (90-lat0)*degree2radians;
phis = (90-lats)*degree2radians;

% theta = longitude
theta0 = lon0*degree2radians;
thetas = lons*degree2radians;

% sperical distance:
cosine = sin(phi0)*sin(phis)*cos(theta0-thetas)+cos(phi0)*cos(phis);
arc = acos(cosine);
dist = arc*radius;

Also in addition to what Dev-iL suggested, you can at least take the following line out of the inner loop:

good = dist(1:j,2)<= 1;

Good luck! Nras

0
votes

The uk is so small that you can still get reasonable results without worrying about the curving of the earth. You can simply estimate the distance by using the difference in latitude and longtitude.

This example is a bit oversimplified, but would suggest that once the data is read in, you could do the actual calculations in under an hour.

x=rand(1.7e6,1);                %Fake x data
y=x;                            %Fake y data
tic
for t=1:1.7e3                   % One thousandst part of the work to be done
    (x-0.5).^2+(x-0.2).^2>0.01; %Simple distance calculation from a point (0.5,0.2), then comparing to treshold
end
toc                             %Runs for about 2 seconds

Using the real distance may take a bit longer, but it should still not take more than 1 or 2 hours to complete.