So, if I were trying to just get the thing implemented, I'd probably start with a global search of all triangles - compute the barycentric coordinates of that 2d point for each triangle, find the triangle where the barycentric coordinates are all positive, and then use those to map to 3d (multiply the stu position by the 3d points). I'd do this first, and only if it's not fast enough would I try something more complex.
If it's possible to iterate by triangle rather than by 2d points, then the barycentric method would probably be fast enough. But it seems like you've got a bunch of 2d points at arbitrary positions that need to be mapped, and the points change position from frame to frame?
If you've got this kind of situation, you could probably get a big speedup by implementing a local update per frame. Each 2d point would remember which triangle it was within. Set that as the current triangle. Test if the new position is within the current triangle. If not, then you want to walk the mesh to the adjacent triangle which is closest to the target 2d point. Each edge-adjacent triangle is composed of the two common points on the edge, plus another point. Find which edge-adjacent triangle's other point is closest to the target, and set that as current. Then iterate - seems like it should find it pretty quickly? You could also cache a max size for each triangle, so if the point has moved a lot you can just iterate to the next neighbor without doing the barycentric computation (the max size would need to be the distance such that if you are farther than that distance from any triangle point there is no chance you're inside the triangle. This is the length of the largest edge).
But as you mention in your comments, you can run into problems with meshes that have concavities, holes, or separate connected components, where you may fall into a local minimum. There are a couple of ways to deal with this. I think the simplest is to keep a list of all visited triangles (maybe as a flag on the triangle, vector< bool > or set< triangle index >) and refuse to revisit a triangle. If you find that you've visited all the neighbors of your current triangle, then fall back to a global search. Such failures are likely to be uncommon, so it shouldn't hurt your performance too much.
This kind of per-frame updating can be very fast, and might even be a decent approach for computing the initial containing triangles - just choose a random triangle and walk from there (changes from checking all n triangles to only those that are in roughly a straight line to the target). If it's not fast enough, what you could do is keep a k-d tree (or something similar) of the 2d mesh points as well as a single touching triangle index for each mesh point. To seed the iteration, find the closest point to the target 2d point in the k-d tree, set the adjacent triangle to be current, and then iterate.