Vectorized Approach
You can do matrix multiplication between A and transpose of B, then sum along dim-2 and finally perform elementwise division with L -
result = sum(A.*B.',2)./L
Benchmarking
This section covers runtime and speedup tests for the proposed approach against the loop-based approach as listed in the early part of the question. Please note that
the other diag based approach as suggested in the second part of the question wasn't covered in these tests as the runtimes with it were comparatively very
big numbers.
Benchmarking Code
N_arr = 4000:4000:100000; %// Datasizes
timeall = zeros(2,numel(N_arr)); %// Array to store runtimes
for iter = 1:numel(N_arr)
%// Create random inputs
N = N_arr(iter);
A=rand(N,4); B=rand(4,N); L=rand(N,1);
%// Time the approaches
f = @() mult_twomat_loopy(A,B,L);
timeall(1,iter) = timeit(f); clear f
f = @() mult_twomat_vect(A,B,L);
timeall(2,iter) = timeit(f); clear f
end
%// Plot speedups
figure,hold on,grid on
plot(N_arr,timeall(1,:)./timeall(2,:),'-bo')
xlabel('Datasize, N(Rows in A)'),ylabel('Speedup Factor (x)')
title('Speedup Plot')
Associated function codes
mult_twomat_loopy.m:
function result = mult_twomat_loopy(A,B,L)
N = size(A,1);
result = zeros(N,1);
for i=1:N
result(i)=A(i,:)*B(:,i)/L(i);
end
mult_twomat_vect.m:
function result = mult_twomat_vect(A,B,L)
result = sum(A.*B.',2)./L;
Speedups for the proposed approach over loopy one

Conclusions
As can be seen from the speedup plot, the proposed approach seems like a very good bet to solve the problem. Interesting observation from the plot again is the sudden dip in performance for the proposed approach against the loopy approach for 32K+ datasizes. The reason behind this dip most likely seems to be the system memory bandwidth pegging back the performance at such large datasizes, but thankfully even for such datasizes the speedup is still fairly maintained at over 20x mark.