In practice, this is not difficult (based on having coded and trained dozens of MLPs).
In a textbook sense, getting the architecture "right" is hard--i.e., to tune your network architecture such that performance (resolution) cannot be improved by further optimization of the architecture is hard, i agree. But only in rare cases is that degree of optimization required.
In practice, to meet or exceed the prediction accuracy from a neural network required by your spec, you almost never need to spend a lot of time with the network architecture--three reasons why this is true:
most of the parameters required to specify the network architecture
are fixed once you have decided on your data model (number of
features in the input vector, whether the desired response variable
is numerical or categorical, and if the latter, how many unique class
labels you've chosen);
the few remaining architecture parameters that are in fact tunable,
are nearly always (100% of the time in my experience) highly constrained by those fixed architecture
parameters--i.e., the values of those parameters are tightly bounded by a max and min value; and
the optimal architecture does not have to be determined before
training begins, indeed, it is very common for neural network code to
include a small module to programmatically tune the network
architecture during training (by removing nodes whose weight values
are approaching zero--usually called "pruning.")
According to the Table above, the architecture of a neural network is completely specified by six parameters (the six cells in the interior grid). Two of those (number of layer type for the input and output layers) are always one and one--neural networks have a single input layer and a single output layer. Your NN must have at least one input layer and one output layer--no more, no less. Second, the number of nodes comprising each of those two layers is fixed--the input layer, by the size of the input vector--i.e., the number of nodes in the input layer is equal to the length of the input vector (actually one more neuron is nearly always added to the input layer as a bias node).
Similarly, the output layer size is fixed by the response variable (single node for numerical response variable, and (assuming softmax is used, if response variable is a class label, the the number of nodes in the output layer simply equals the number of unique class labels).
That leaves just two parameters for which there is any discretion at all--the number of hidden layers and the number of nodes comprising each of those layers.
The Number of Hidden Layers
if your data is linearly separable (which you often know by the time you begin coding a NN) then you don't need any hidden layers at all. (If that's in fact the case, i would not use a NN for this problem--choose a simpler linear classifier).
The first of these--the number of hidden layers--is nearly always one. There is a lot of empirical weight behind this presumption--in practice very few problems that cannot be solved with a single hidden layer become soluble by adding another hidden layer. Likewise, there is a consensus is the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very small. One hidden layer is sufficient for the large majority of problems.
In your question, you mentioned that for whatever reason, you cannot find the optimum network architecture by trial-and-error. Another way to tune your NN configuration (without using trial-and-error) is 'pruning'. The gist of this technique is removing nodes from the network during training by identifying those nodes which, if removed from the network, would not noticeably affect network performance (i.e., resolution of the data). (Even without using a formal pruning technique, you can get a rough idea of which nodes are not important by looking at your weight matrix after training; look for weights very close to zero--it's the nodes on either end of those weights that are often removed during pruning.) Obviously, if you use a pruning algorithm during training then begin with a network configuration that is more likely to have excess (i.e., 'prunable') nodes--in other words, when deciding on a network architecture, err on the side of more neurons, if you add a pruning step.
Put another way, by applying a pruning algorithm to your network during training, you can much closer to an optimized network configuration than any a priori theory is ever likely to give you.
The Number of Nodes Comprising the Hidden Layer
but what about the number of nodes comprising the hidden layer? Granted this value is more or less unconstrained--i.e., it can be smaller or larger than the size of the input layer. Beyond that, as you probably know, there's a mountain of commentary on the question of hidden layer configuration in NNs (see the famous NN FAQ for an excellent summary of that commentary). There are many empirically-derived rules-of-thumb, but of these, the most commonly relied on is the size of the hidden layer is between the input and output layers. Jeff Heaton, author of "Introduction to Neural Networks in Java" offers a few more, which are recited on the page i just linked to. Likewise, a scan of the application-oriented neural network literature, will almost certainly reveal that the hidden layer size is usually between the input and output layer sizes. But between doesn't mean in the middle; in fact, it is usually better to set the hidden layer size closer to the size of the input vector. The reason is that if the hidden layer is too small, the network might have difficultly converging. For the initial configuration, err on the larger size--a larger hidden layer gives the network more capacity which helps it converge, compared with a smaller hidden layer. Indeed, this justification is often used to recommend a hidden layer size larger than (more nodes) the input layer--ie, begin with an initial architecture that will encourage quick convergence, after which you can prune the 'excess' nodes (identify the nodes in the hidden layer with very low weight values and eliminate them from your re-factored network).