Federico Mengozzi



Since some packets might be lost or arrive at destination with a delay, it’s important to understand when to play back a chunk and what to do with a missing packet.

Packet loss and delay

VoIP usually cannot take advantage of TCP packet-loss prevention since it increases the end-to-end delay, something not affordable for VoIP, for this reason VoIP usually uses UDP.

Usually, an end-to-end delay of up to 150ms is not perceived by humans, 150 to 400ms is acceptable and a delay of more than 400ms should be avoided.

There are several technique to prevent packet loss, these are called loss recovery schemes.

Forward Error Correction

The first approach is to add redundant information to the original packet stream. One way is to add a chuck, every $n$ chuck calculated as the XOR or the $n$ original chunks, in thi way if one of these $n+1$ is lost, it’s possible to reconstruct it. The larger the $n$, the longer is the delay playout.

Another way is to include a lower-resolution, in particular the $n$-th low-resolution chuck is appended to the $n+1$ original chunk. Using this approach, only two packets are required to be received before playing the media, the playout delay is so kept low. It’s easy to improve on this by including in each packet the $n-1$, $n-2$ packet as well, as best suitable for the situation.


Interleaving consist of diving a portion of the video in several units, for example $u_1, \dots, u_k$, creating two new chunks with units $u_{2i}$ and units $u_{2i+1}$. In this way, if only one chuck is received, the chuck will only have many small holes, rather than losing half of a speech spurt. Interleaving can drastically increase the quality of the audio, but it increases the latency.

Error Concealment

Speech spurt show large amounts of short-term similarities. One way to take advantage of this is to use packet repetition (lowe over-head, not great quality) as well as using interpolation (high over-head, but great quality).

Packet jitters

Jitter is defined as the varying queuing delays that packets experience. Jitter can be removed using sequence number, timestamps and playout delay. Sequence number and timestamps are assigned on the sender side when the chunk is generated, while the playout put delay is handled on the receiver side.

Fixed playout delay

A chunk is played exactly after $q$ ms it has been generated. The value $q$ varies depending on the end-to-end delay network and it’s not a trivial matter.

Adaptive playout delay

In contrast to fixed playout chunks are not considered as spurt of speech. Similar to fixed playout, there is still a value used to determined when a packet should be played.

The value $d_{i+1} = (1-u) \cdot d_i - 1 + u (r_i - t_i)$, where $t_i$ is timestamps of generation, $r_i$ is the time a packets is received,

is used to calculate the variable end-to-end delay smoothed over several networks delays. More emphasis is placed on recent packets, rather than on packets received in the far past. In addition to the average delay,

a variance $v_{i+1} = (1-u) \cdot v_i - 1 + \mid r_i - t_i - d_i \mid$, where is useful to plan the playout of different bust. Finally, a packet $i$ will be played at $p_i = t_i + d_i + K v_i$. $K$ is used to set the playout in the future.

The playout point at which start playing a packet depends on the first packet of a new spurt of speech. Assuming $q_i = p_i - t_i$ is the delay since the first packet was generated until it was played, then a packet $j$ belonging to same spurt will be played at $p_j = t_j + q_i$

Go to top