First foremost, you need to understand how this works.
The sender i.e. the creator of RTP stream is probably doing the following:
- Uses a source for the data: In case of audio, this could be the microphone or audio samples or a file
- Encodes the audio using a audio codec such as AAC or Opus.
- Uses RTP packetizer to create RTP packets from encoded audio frames
- Uses a transport layer such as UDP to send these packets
Protocols such as RTSP provides the necessary signaling information to provide better stream information. Usually RTP itself isn't enough as things such as congestion control, feedback, dynamic bit rate are handled with the help of RTCP.
Anyway, in order to store the incoming stream, you need to do the following:
Use a RTP depacketizer to get the encoded audio frames out of it. You can write your own or use a third party implementation. In fact ffmpeg is a big framework which has all necessary code for most of the codecs and protocols. However for your case, find a simple RTP depacketizer. There could be headers corresponding to a particular codec to make sure you refer to a correct RFC.
Once you have access to encoded frames, you can write the same in a media container such as m4a or ogg depending upon the audio codec used in the stream.
In order to play the stream, you need to do the following:
Use a RTP depacketizer to get the encoded audio frames out of it. You can write your own or use a third party implementation. In fact ffmpeg is a big framework which has all necessary code for most of the codecs and protocols. However for your case, find a simple RTP depacketizer.
Once you have access to encoded frames, use a audio decoder (available as a library) to decode the frames or check if your platform supports that codec directly for playback
Once you have access to decoded frames, in iOS, you can use AVFoundation to play the same.
If you are looking at an easy way to do it, may be use a third party implementation such as http://audiokit.io/