It’s convenient to have standardized packet structures that includes fields for multimedia data, sequence number, timestamps and other potentially useful fields, this is implemented in RTP.
It runs on top of UDP, each chuck of media is encapsulated in a RTP packets that in turns is put in a UDP segment. Each chuck is preceded by a RTP header containing:
- Payload type: encoding used for the media
- Sequence number: 16 bits integer
- Timestamp: 32 bits integer, represent the sampling instant of the first byte in the RTP packet
- Synchronization source identifier: A 32 bits integer, is a random number to identify the source of each separated RTP stream.
RPT allows for each source to be assigned an independent RTP stream, so for a video conference 4 RTP streams are generally used, two for audio (one on both end) and two for video (again one on both end).
Although RTP provide a common interface between two end systems, it doesn’t provide all functIonality that might be required. Indeed, RTP is used in pair with Session Protocol Initiation SIP.
It allows to stablish a call between two pair over an IP network, it’s used for call management in general, for the caller to find the callee’s address.
SIP is an out-of-band protocol, SIP messages are sent on different sockets. The caller sends a SIP INVITE message to the callee containing the preferred format for the media to be encoded, as well as the caller IP and the port on which to receive RTP packets.
For the caller to find the callee address, SIP uses a SIP proxy server that determines the callee IP by looking it up on a SIP register and forwarding the invite message to the callee. The register is updated by each client every time they go online. It works very similarly to DNS.