Video streaming protocol - handling fragmentation

Question

Recently i was trying to stream live video captured by webcam over udp. The approach I took was to read one frame,send it over udp and read the data at receiver side and display it.

Now, I understand that sending data over udp/tcp results in fragmentation which happens at any random fashion depending on the MTU of the transport layer and the underlying IP protocol does not guarantee number of frames that will be delivered. Minimum MTU of any data layer is said to be 1500 bytes.

However, my each frame is of 1MB ( ~1048576 bytes). So considering data fragmentation at 1500 bytes a single frame might get fragmented and the receiver would then get ~ 700 packets (1048576/1500). Now the receiver needs to accumulate data of all these 700 packets for just one frame which involves additional processing. Is this something normal, 700 packets for just 1 frame data !!. If I want to keep frame rate at just 24fps, which means the receiver has to process 700*24 = 16800 packets/second, which does not seem to be feasible.

I want to understand how does another streaming websites work, they definitely don't process 16800 data packets/second. They would be using other streaming protocols like RTSP, however these are built on top of UDP/TCP, which means these protocol also needs to handle the fragmentation. These days streaming websites can deliver 4k video, and each frame size will be much bigger than 1MB but the MTU is still 1500 bytes. They must also be doing data-compression, but to what extent . Even if they somehow manage to reduce frame size by 50% (which also needs to be de-compressed at receiver side which means additional processing) they will still need to process ~8000 data packets/second for a low quality 24fps video. How do they handle it, how do they manage data fragmentation at these scales.

Video streaming is usually not done by sending each frame independent from the other. This does not scale. Instead compression is used both of a frame itself and of the difference between frames, which can be pretty effective since frames usually don't differ that much. This way the needed bandwidth is greatly reduced and so is the number of packets which needs to be processed. See Wikipedia for more details. — Steffen Ullrich

Ralf Ralf · Accepted Answer · 2020-10-17T13:42:34

Uncompressed data is very rarely sent over networks. Currently adopted video codecs such as H.264 AVC, H.265 HEVC, H.266 VVC, VP8, VP9, AV1 achieve amazing compression rates that depend on a number of parameters including resolution, framerate, target bitrate, fidelity requirements, real-time requirements and storage or delivery network capacity to name a couple.

The streaming web sites you are referring to all use compression, not only to deliver video, but also to store the content in different containers such as avi, mp4 or mkv files.

The choice of streaming protocols depends also on the real-time requirements (milliseconds vs seconds), infrastructure requirements, scalability requirements and the complexity of solutions as well as the target client devices and capabilities (e.g. computer, tablet, phone). For example HTTP-based streaming protocols allows for re-use of well tested and understood HTTP infrastructure and software and includes advantages such as caching which is useful for scaling the number of requests that can be served.

Real-time streaming used for low-delay use cases such as video communication (e.g. WebRTC) where delay needs to be kept under ~150ms is typically done over RTP/UDP. For signaling look at RTSP, SIP and WebRTC. Other protocols (non-IETF) include RTMP which was developed by Adobe but has been in decline over several years (AFAIR).

Even compressed frames can be many thousand bytes in size as you state. When streaming over RTP/UDP bigger encoded frames are split over several packets/datagrams using RTP payload formats e.g. RFC6184, RFC7741, RFC7798 which specify how a frame can be fragmented.

HTTP-based adaptive streaming has less stringent delay requirements and here HTTP manages your message framing. Protocols include HTTP Live Streaming, MPEG DASH to name a few.

Even if they somehow manage to reduce frame size by 50% (which also needs to be de-compressed at receiver side which means additional processing)

The codecs mentioned achieve way better compression rates, the additional processing is negligible and for widely-used codecs encoding/decoding is typically supported by the hardware. Your phone more than likely has hardware to decode H.264 very efficiently.

Video streaming protocol - handling fragmentation

1 Answers