3
votes

I am generating MP4 files (with h.264 video and AAC audio) by transmuxing from MPEG-TS in JavaScript to be played in the browser via blob URLs. Everything works fine in Chrome, and if I grab the blob URLs out of the developer console and download them, the generated files play fine on Windows Media Player as well. Firefox, however, claims that they are corrupted.

I've narrowed the issue down to a problem with the ESDS box in the audio metadata. If I repackage the source MPEG-TS files by some other means (like ffmpeg), and hand-edit my generated files in a hex editor to paste in the ESDS box from the equivalent file generated by other software, then Firefox is happy.

My code that builds the ESDS box. (And I'm tracking the issue)

I attempted to write it by a pretty straightforward transcribe-stuff-from-the-MPEG-specs process, but that is no guarantee that I did not screw it up. Since Chrome and Windows Media play my files just fine, I'm not sure if it's actually an error in my file that they are somehow capable of ignoring, or if it's a problem with Firefox. I suspect the former, but I'm just not sure.

Anyone got any insight, or perhaps a straightforward, easy-to-understand reference for how to build a proper ESDS box?

EDIT: Here are some different ESDS sections produced for the same input file (as hex bytes, copied out of my hex editor):

Mine:

00 00 00 27 65 73 64 73 00 00 00 00 03 22 00 00
02 04 14 40 15 00 00 00 00 00 3a f1 00 00 2d e6
05 02 12 10 06 01 02

mpegts:

00 00 00 33 65 73 64 73 00 00 00 00 03 80 80 80
22 00 02 00 04 80 80 80 14 40 15 00 00 00 00 00
00 00 00 00 00 00 05 80 80 80 02 12 10 06 80 80
80 01 02

ffmpeg:

00 00 00 2c 65 73 64 73 00 00 00 00 03 80 80 80
1b 00 02 00 04 80 80 80 0d 40 15 00 00 00 00 01
5f 42 00 00 00 00 06 80 80 80 01 02

Oddly, and I did not notice this before, Firefox will play the video with ffmpeg's output, but neither Firefox nor Windows Media will actually play the sound (Chrome does). Firefox and Windows Media are both happy to play the video with sound using the output from mpegts, though. With mine, Chrome and Windows Media will play with video with sound, but Firefox doesn't play at all, and claims the video is corrupted.

3
Good stuff. Can you paste two sample copies of ESDS bytes (yours then FFMpeg) into your Question so we can compare for you? The code looks okay though.VC.One
@VC.One Good idea. Samples duly added.Logan R. Kearsley
PS: Forgot to say that your shown FFMpeg bytes have as size of 0x2C which is 44 and also after three padding bytes it has a value of 0x1B FFMpeg only does these things to MP3 audio so is it possible you are (accidentally) creating an MP4 with MP3 audio?. Your own JS bytes correctly have 0x22 but no padding though. Just be aware when comparing these shown bytes that you're dealing with two separate audio codecs between you & FFMpeg.VC.One

3 Answers

2
votes

Well, I found an answer to my own question. Upon pondering the differences between my ESDS boxes and those produced by other software, it became apparent that the biggest difference was the presence of these 0x80 padding bytes- three of them after every ES Descriptor tag number. Add those in, and most everything else lines up and looks pretty much the same.

I can find no mention in the MPEG specs for MP4 files or ISOBMFF of why those bytes should be there, but adding them in makes it work- Firefox no longer thinks the files are corrupted.

2
votes

You have now found your solution by adding three bytes of 0x80 each after the ES Descriptor Tag number. Glad that worked out for all browsers.

Let me share one insight that may help you or future users of your code:

"..I can find no mention in the MPEG specs for MP4 files or ISOBMFF of why those bytes should be there, but adding them in makes it work.."

Well looking at this link for mp4ESDSbox.java we see ESDS atom is broken into five sections and each section is padded by the bytes 80 80 80. These three bytes are decribed as "optional extended descriptor type tag string" with possible types values being.. 80 or 81 or FE

You're on the right path but you only have padded the first section.

MP4Muxer.js : (A) What you currently have...

00 00 00 27 65 73 64 73 00 00 00 00 03 80 80 80
22 00 00 02 04 14 40 15 00 00 00 00 00 3A F1 00
00 2D E6 05 02 12 10 06 01 02

MP4Muxer.js: (B) What it should be...

00 00 00 33 65 73 64 73 00 00 00 00 03 80 80 80
22 00 00 02 04 80 80 80 14 40 15 00 00 00 00 00
3A F1 00 00 2D E6 05 80 80 80 02 12 10 06 80 80
80
01 02

FFMpeg ESDS for random AAC track : Compare against new (B) version

00 00 00 33 65 73 64 73 00 00 00 00 03 80 80 80
22 00 01 00 04 80 80 80 14 40 15 00 00 00 00 01
F4 74 00 01 F4 74 05 80 80 80 02 12 10 06 80 80
80
01 02

Comparing the bytes structure of version B) against those made by FFMpeg we see now there is perfect alignment. Some values are slightly different cos they are not made from the same audio data.

Notice we have changed the first four bytes (size integer) to x33 (decimal == 51 bytes length) from the original x27 which was (decimal == 39 bytes length)

2
votes

The 0x80 bytes do not belong to the tag before it, but to the length value after it. Version 2 of the ISO spec changed the interpretation of the length value so it can wrap more than 255 bytes by making it a 'VarInt32' type. The high bit in each byte denotes there is another length byte following, the lower 7 bits encode the value.

You could use this to encode arbitrary large values, but the ISO spec limits this to 4 bytes at most, or 0...2^(4*7)-1.

I.e.:

0x80,0x80,0x80,0x0E = 0x80,0x0E = 0x0E => 14
0x80,0x80,0x84,0x7f = 0x84,0x7f => 0x4 << 7 + 0x7f = 0x27f = 639

The same encoding is e.g. used by Googles protobuf, named Base128 Varint.