Closely evaluate what you mean by "profiling". You are indeed operating very close to bare metal, and it's likely that you will be required to take on some of the work performed by a tool like gprof.
Do you want to time a function call? or an ISR? How about toggling a GPIO line upon entering and exiting the code under inspection. A data logger or oscilloscope can be set to trigger on these events. (In my experience, a data logger is more convenient since mine could be configured to capture a sequence of these events - allowing me to compute average timings.)
Do you want to count the number of executions? The Cortex A8 comes equipped with a number of features (like configurable event counters) that can assist: link. Your ARM chip may be equipped with other peripherals that could be used, as well (depending on the vendor). Regardless, take a look at the above link - the new ARMs have lots of cool features that I don't get to play with as much as I would like! ;-)