Audio and video timestamps are calculated in the same way.
Audio RTP payload formats typically uses an 8Khz clock. Then take the first audio sample containing e.g. 20ms and assign this timestamp t = 0. 20 ms is a 1/50 of a second, hence this equals a 8000/50 = 160 timestamp increment for the following sample. In the case of audio you can calculate the sample duration based on the sample rate, bits per sample, and number of channels. In the case of a live source, the sample might be timestamped already, and you simply have to translate the timestamp to an RTP timestamp.
Also, note that the RTP timestamp should start at a random number and not at zero.
A good reference for RTP in general can be found in the Colin Perkins book, RTP - Audio and video for the Internet. While some of the information might be outdated, it will give you a good understanding of RTP.
Update:
The wall clock is determined by using the NTP timestamp in the RTCP SR, which tells you what time the RTP timestamp maps to i.e. RTP timestamp 160 = some date time. This is required e.g. for synchronisation to video where both RTP timestamps are using different random offsets. Of course the validity of the date/time depends on how the NTP timestamp is calculated on the sender and there is no guarantee that this reflects the actual date/time. You can sniff the traffic with wireshark which will tell you what date/time is represented by an NTP timestamp.
Update 2:
It depends on what the client is expecting and how the client is implemented. If you were writing your own RTSP client, it might suffice . E.g if you wrote a DirectShow source filter you could translate the RTP timestamp directly into a media timestamp and that would work. However since you're using an existing client, it could e.g. be that the client only uses synchronised RTP timestamps so in such a case it would not suffice. In summary it depends on the client implementation. I'm not sure what VLC is expecting.