I think the appropriate solution is to abstract this functionality into a function. That function may well be templated; and it may well use a loop - the loop will be really short, after all. Many matrix operations are implemented using loops - that's not a problem.
For example, given your example of...
MatrixXd p0(2, 4);
p0 <<
1, 23, 6, 9,
3, 11, 7, 2;
MatrixXd p1(2, 2);
p1 <<
2, 20,
3, 10;
then we can construct a matrix D such that D(i,j) = |p0(i) - p1(j)|2
MatrixXd D(p0.cols(), p0.rows());
for (int i = 0; i < p1.cols(); i++)
D.col(i) = (p0.colwise() - p1.col(i)).colwise().squaredNorm().transpose();
I think this is fine - we can use some broadcasting to avoid 2 levels of nesting: we iterate over p1's points, but not over p0's points, nor over their dimensions.
However, you can make a oneliner if you observe that |p0(i) - p1(j)|2 = |p0(i)|2 + |p1(j)|2 - 2 p0(i)Tp1(j). In particular, the last component is just matrix multiplication, so D = -2 p0Tp1 + ...
The blank left to be filled is composed of a component that only depends on the row; and a component that only depends on the column: these can be expressed using rowwise and columnwise operations.
The final "oneliner" is then:
D = ( (p0.transpose() * p1 * -2
).colwise() + p0.colwise().squaredNorm().transpose()
).rowwise() + p1.colwise().squaredNorm();
You could also replace the rowwise/colwise trickery with an (outer) product with a 1 vector.
Both methods result in the following (squared) distances:
1 410
505 10
32 205
50 185
You'd have to benchmark which is fastest, but I wouldn't be surprised to see the loop win, and I expect that's more readable too.