I will try to answer your last question first, hopefully it will make things clearer.
Whenever a graphic artist creates a building, a monster or a landscape, she will need to define a coordinate system. It consists in an origin (where is the 0,0) and some axises (where are the x and y directions?). The choice is completely arbitrary and has no real importance, but for the sake of simplicity (and to an extend, to reduce errors due to lack of precision), the origin usually is either close to the center of the object or at a reference point (the root of a tree for instance).
Now should the artist gather the monster and building within the landscape, the coordinates will likely not match. She could have created the building altogether with the landscape, although this is not always feasible, let alone handy, but for the monster to run after some pitiful MMO player, that is just not possible.
So we need a way to know where the arms, legs, teeth, tentacles and whatever else you would prefer not see, will end up in the world while the monster is running. Their position is very well know relatively to the monster. This is what we might call, say, the monster coordinate. More generally, we would call the local coordinates, meaning local to the monster.
So what are the world coordinates? Usually they refer to what makes most sense as a reference, the element considered to be not moving around something else. Here, the landscape.
This is where matrices come to play. What is the matrix? The matrix is an operator that allows to express coordinates in a different coordinate system. It is a projection of the coordinates system from a system to another: from monster to scene, from scene to camera, from camera to screen...
A matrix can express any transformation from one system to another: translation, rotation, scaling, shearing, flattening... Or all of them a the same time. The identity matrix is the matrix that does not change anything. Matrices can also be combined: by multiplying a translation matrix and a rotation matrix, we get the resulting transformation matrix of both translating then rotating. Do this a couple of times and you get the position of the tip of the articulated arm of a robot in a car factory by just combining the matrices of each joint.
Then we run into where your problem may lie: translating then rotating is not the same as rotating then translating. If you are not convinced about it try by yourself: walk then turn or turn then walk, and see how you do not end in the same location. So in the end it means matrices have to be applied in a specific order, which depends only on what you want to do.
From the explanations you give, I suspect this is where things are going wrong for you, since translating will give a different result if you scale first.