MPEG
(Taken from Compression FAQ and written by written by Mark AdlerQ: What is MPEG, exactly? A: MPEG is the "Moving Picture Experts Group", working under the joint direction of the International Standards Organization (ISO) and the International Electro-Technical Commission (IEC). This group works on standards for the coding of moving pictures and associated audio. Q: What is the status of MPEG's work, then? What's about MPEG-1, -2, and so on? A: MPEG approaches the growing need for multimedia standards step-by- step. Today, three "phases" are defined: MPEG-1: "Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 MBit/s" Status: International Standard IS-11172, completed in 10.92 MPEG-2: "Generic Coding of Moving Pictures and Associated Audio" Status: Comittee Draft CD 13818 as found in documents MPEG93 / N601, N602, N603 (11.93) MPEG-3: no longer exists (has been merged into MPEG-2) MPEG-4: "Very Low Bitrate Audio-Visual Coding" Status: Call for Proposals 11.94, Working Draft in 11.96 Q: MPEG-1 is ready-for-use. How does the standard look like? A: MPEG-1 consists of 4 parts: IS 11172-1: System describes synchronization and multiplexing of video and audio IS 11172-2: Video describes compression of non-interlaced video signals IS 11172-3: Audio describes compression of audio signals CD 11172-4: Compliance Testing describes procedures for determining the characteristics of coded bitstreams and the decoding porcess and for testing compliance with the requirements stated in the other parts Q. Does MPEG have anything to do with JPEG? A. Well, it sounds the same, and they are part of the same subcommittee of ISO along with JBIG and MHEG, and they usually meet at the same place at the same time. However, they are different sets of people with few or no common individual members, and they have different charters and requirements. JPEG is for still image compression. Q. Then what's JBIG and MHEG? A. Sorry I mentioned them. Ok, I'll simply say that JBIG is for binary image compression (like faxes), and MHEG is for multi-media data standards (like integrating stills, video, audio, text, etc.). For an introduction to JBIG, see question 74 below. Q. So how does MPEG-1 work? Tell me about video coding! A. First off, it starts with a relatively low resolution video sequence (possibly decimated from the original) of about 352 by 240 frames by 30 frames/s (US--different numbers for Europe), but original high (CD) quality audio. The images are in color, but converted to YUV space, and the two chrominance channels (U and V) are decimated further to 176 by 120 pixels. It turns out that you can get away with a lot less resolution in those channels and not notice it, at least in "natural" (not computer generated) images. The basic scheme is to predict motion from frame to frame in the temporal direction, and then to use DCT's (discrete cosine transforms) to organize the redundancy in the spatial directions. The DCT's are done on 8x8 blocks, and the motion prediction is done in the luminance (Y) channel on 16x16 blocks. In other words, given the 16x16 block in the current frame that you are trying to code, you look for a close match to that block in a previous or future frame (there are backward prediction modes where later frames are sent first to allow interpolating between frames). The DCT coefficients (of either the actual data, or the difference between this block and the close match) are "quantized", which means that you divide them by some value to drop bits off the bottom end. Hopefully, many of the coefficients will then end up being zero. The quantization can change for every "macroblock" (a macroblock is 16x16 of Y and the corresponding 8x8's in both U and V). The results of all of this, which include the DCT coefficients, the motion vectors, and the quantization parameters (and other stuff) is Huffman coded using fixed tables. The DCT coefficients have a special Huffman table that is "two-dimensional" in that one code specifies a run-length of zeros and the non-zero value that ended the run. Also, the motion vectors and the DC DCT components are DPCM (subtracted from the last one) coded. Q. So is each frame predicted from the last frame? A. No. The scheme is a little more complicated than that. There are three types of coded frames. There are "I" or intra frames. They are simply a frame coded as a still image, not using any past history. You have to start somewhere. Then there are "P" or predicted frames. They are predicted from the most recently reconstructed I or P frame. (I'm describing this from the point of view of the decompressor.) Each macroblock in a P frame can either come with a vector and difference DCT coefficients for a close match in the last I or P, or it can just be "intra" coded (like in the I frames) if there was no good match. Lastly, there are "B" or bidirectional frames. They are predicted from the closest two I or P frames, one in the past and one in the future. You search for matching blocks in those frames, and try three different things to see which works best. (Now I have the point of view of the compressor, just to confuse you.) You try using the forward vector, the backward vector, and you try averaging the two blocks from the future and past frames, and subtracting that from the block being coded. If none of those work well, you can intracode the block. The sequence of decoded frames usually goes like: IBBPBBPBBPBBIBBPBBPB... Where there are 12 frames from I to I (for US and Japan anyway.) This is based on a random access requirement that you need a starting point at least once every 0.4 seconds or so. The ratio of P's to B's is based on experience. Of course, for the decoder to work, you have to send that first P *before* the first two B's, so the compressed data stream ends up looking like: 0xx312645... where those are frame numbers. xx might be nothing (if this is the true starting point), or it might be the B's of frames -2 and -1 if we're in the middle of the stream somewhere. You have to decode the I, then decode the P, keep both of those in memory, and then decode the two B's. You probably display the I while you're decoding the P, and display the B's as you're decoding them, and then display the P as you're decoding the next P, and so on. Q. You've got to be kidding. A. No, really! Q. Hmm. Where did they get 352x240? A. That derives from the CCIR-601 digital television standard which is used by professional digital video equipment. It is (in the US) 720 by 243 by 60 fields (not frames) per second, where the fields are interlaced when displayed. (It is important to note though that fields are actually acquired and displayed a 60th of a second apart.) The chrominance channels are 360 by 243 by 60 fields a second, again interlaced. This degree of chrominance decimation (2:1 in the horizontal direction) is called 4:2:2. The source input format for MPEG I, called SIF, is CCIR-601 decimated by 2:1 in the horizontal direction, 2:1 in the time direction, and an additional 2:1 in the chrominance vertical direction. And some lines are cut off to make sure things divide by 8 or 16 where needed. Q. What if I'm in Europe? A. For 50 Hz display standards (PAL, SECAM) change the number of lines in a field from 243 or 240 to 288, and change the display rate to 50 fields/s or 25 frames/s. Similarly, change the 120 lines in the decimated chrominance channels to 144 lines. Since 288*50 is exactly equal to 240*60, the two formats have the same source data rate. Q. What will MPEG-2 do for video coding? A. As I said, there is a considerable loss of quality in going from CCIR-601 to SIF resolution. For entertainment video, it's simply not acceptable. You want to use more bits and code all or almost all the CCIR-601 data. From subjective testing at the Japan meeting in November 1991, it seems that 4 MBits/s can give very good quality compared to the original CCIR-601 material. The objective of MPEG-2 is to define a bit stream optimized for these resolutions and bit rates. Q. Why not just scale up what you're doing with MPEG-1? A. The main difficulty is the interlacing. The simplest way to extend MPEG-1 to interlaced material is to put the fields together into frames (720x486x30/s). This results in bad motion artifacts that stem from the fact that moving objects are in different places in the two fields, and so don't line up in the frames. Compressing and decompressing without taking that into account somehow tends to muddle the objects in the two different fields. The other thing you might try is to code the even and odd field streams separately. This avoids the motion artifacts, but as you might imagine, doesn't get very good compression since you are not using the redundancy between the even and odd fields where there is not much motion (which is typically most of image). Or you can code it as a single stream of fields. Or you can interpolate lines. Or, etc. etc. There are many things you can try, and the point of MPEG-2 is to figure out what works well. MPEG-2 is not limited to consider only derivations of MPEG-1. There were several non-MPEG-1-like schemes in the competition in November, and some aspects of those algorithms may or may not make it into the final standard for entertainment video compression. Q. So what works? A. Basically, derivations of MPEG-1 worked quite well, with one that used wavelet subband coding instead of DCT's that also worked very well. Also among the worked-very-well's was a scheme that did not use B frames at all, just I and P's. All of them, except maybe one, did some sort of adaptive frame/field coding, where a decision is made on a macroblock basis as to whether to code that one as one frame macroblock or as two field macroblocks. Some other aspects are how to code I-frames--some suggest predicting the even field from the odd field. Or you can predict evens from evens and odds or odds from evens and odds or any field from any other field, etc. Q. So what works? A. Ok, we're not really sure what works best yet. The next step is to define a "test model" to start from, that incorporates most of the salient features of the worked-very-well proposals in a simple way. Then experiments will be done on that test model, making a mod at a time, and seeing what makes it better and what makes it worse. Example experiments are, B's or no B's, DCT vs. wavelets, various field prediction modes, etc. The requirements, such as implementation cost, quality, random access, etc. will all feed into this process as well. Q. When will all this be finished? A. I don't know. I'd have to hope in about a year or less. Q: Talking about MPEG audio coding, I heard a lot about "Layer 1, 2 and 3". What does it mean, exactly? A: MPEG-1, IS 11172-3, describes the compression of audio signals using high performance perceptual coding schemes. It specifies a family of three audio coding schemes, simply called Layer-1,-2,-3, with increasing encoder complexity and performance (sound quality per bitrate). The three codecs are compatible in a hierarchical way, i.e. a Layer-N decoder is able to decode bitstream data encoded in Layer-N and all Layers below N (e.g., a Layer-3 decoder may accept Layer-1,-2 and -3, whereas a Layer-2 decoder may accept only Layer-1 and -2.) Q: So we have a family of three audio coding schemes. What does the MPEG standard define, exactly? A: For each Layer, the standard specifies the bitstream format and the decoder. To allow for future improvements, it does *not* specify the encoder , but an informative chapter gives an example for an encoder for each Layer. Q: What have the three audio Layers in common? A: All Layers use the same basic structure. The coding scheme can be described as "perceptual noise shaping" or "perceptual subband / transform coding". The encoder analyzes the spectral components of the audio signal by calculating a filterbank or transform and applies a psychoacoustic model to estimate the just noticeable noise- level. In its quantization and coding stage, the encoder tries to allocate the available number of data bits in a way to meet both the bitrate and masking requirements. The decoder is much less complex. Its only task is to synthesize an audio signal out of the coded spectral components. All Layers use the same analysis filterbank (polyphase with 32 subbands). Layer-3 adds a MDCT transform to increase the frequency resolution. All Layers use the same "header information" in their bitstream, to support the hierarchical structure of the standard. All Layers use a bitstream structure that contains parts that are more sensitive to biterrors ("header", "bit allocation", "scalefactors", "side information") and parts that are less sensitive ("data of spectral components"). All Layers may use 32, 44.1 or 48 kHz sampling frequency. All Layers are allowed to work with similar bitrates: Layer-1: from 32 kbps to 448 kbps Layer-2: from 32 kbps to 384 kbps Layer-3: from 32 kbps to 320 kbps Q: What are the main differences between the three Layers, from a global view? A: From Layer-1 to Layer-3, complexity increases (mainly true for the encoder), overall codec delay increases, and performance increases (sound quality per bitrate). Q: Which Layer should I use for my application? A: Good Question. Of course, it depends on all your requirements. But as a first approach, you should consider the available bitrate of your application as the Layers have been designed to support certain areas of bitrates most efficiently, i.e. with a minimum drop of sound quality. Let us look a little closer at the strong domains of each Layer. Layer-1: Its ISO target bitrate is 192 kbps per audio channel. Layer-1 is a simplified version of Layer-2. It is most useful for bitrates around the "high" bitrates around or above 192 kbps. A version of Layer-1 is used as "PASC" with the DCC recorder. Layer-2: Its ISO target bitrate is 128 kbps per audio channel. Layer-2 is identical with MUSICAM. It has been designed as trade- off between sound quality per bitrate and encoder complexity. It is most useful for bitrates around the "medium" bitrates of 128 or even 96 kbps per audio channel. The DAB (EU 147) proponents have decided to use Layer-2 in the future Digital Audio Broadcasting network. Layer-3: Its ISO target bitrate is 64 kbps per audio channel. Layer-3 merges the best ideas of MUSICAM and ASPEC. It has been designed for best performance at "low" bitrates around 64 kbps or even below. The Layer-3 format specifies a set of advanced features that all address one goal: to preserve as much sound quality as possible even at rather low bitrates. Today, Layer-3 is already in use in various telecommunication networks (ISDN, satellite links, and so on) and speech announcement systems. Q: Tell me more about sound quality. How do you assess that? A: Today, there is no alternative to expensive listening tests. During the ISO-MPEG-1 process, 3 international listening tests have been performed, with a lot of trained listeners, supervised by Swedish Radio. They took place in 7.90, 3.91 and 11.91. Another international listening test was performed by CCIR, now ITU-R, in 92. All these tests used the "triple stimulus, hidden reference" method and the CCIR impairment scale to assess the audio quality. The listening sequence is "ABC", with A = original, BC = pair of original / coded signal with random sequence, and the listener has to evaluate both B and C with a number between 1.0 and 5.0. The meaning of these values is: 5.0 = transparent (this should be the original signal) 4.0 = perceptible, but not annoying (first differences noticable) 3.0 = slightly annoying 2.0 = annoying 1.0 = very annoying With perceptual codecs (like MPEG audio), all traditional parameters (like SNR, THD+N, bandwidth) are especially useless. Fraunhofer-IIS works on objective quality assessment tools, like the NMR meter (Noise-to-Mask-Ratio), too. BTW: If you need more informations about NMR, please contact nmr@iis.fhg.de. Q: Now that I know how to assess quality, come on, tell me the results of these tests. A: Well, for low bitrates, the main result is that at 60 or 64 kbps per channel), Layer-2 scored always between 2.1 and 2.6, whereas Layer-3 scored between 3.6 and 3.8. This is a significant increase in sound quality, indeed! Furthermore, the selection process for critical sound material showed that it was rather difficult to find worst-case material for Layer-3 whereas it was not so hard to find such items for Layer-2. Q: OK, a Layer-2 codec at low bitrates may sound poor today, but couldn't that be improved in the future? I guess you just told me before that the encoder is not fixed in the standard. A: Good thinking! As the sound quality mainly depends on the encoder implementation, it is true that there is no such thing as a "Layer- N"- quality. So we definitely only know the performance of the reference codecs during the international tests. Who knows what will happen in the future? What we do know now, is: Today, Layer-3 already provides a sound quality that comes very near to CD quality at 64 kbps per channel. Layer-2 is far away from that. Tomorrow, both Layers may improve. Layer-2 has been designed as a trade-off between quality and complexity, so the bitstream format allows only limited innovations. In contrast, even the current reference Layer-3-codec exploits only a small part of the powerful mechanisms inside the Layer-3 bitstream format. Q: All in all, you sound as if anybody should use Layer-3 for low bitrates. Why on earth do some vendors still offer only Layer-2 equipment for these applications? A: Well, maybe because they started to design and develop their system rather early, e.g. in 1990. As Layer-2 is identical with MUSICAM, it has been available since summer of 90, at latest. In that year, Layer-3 development started and could be successfully finished in spring 92. So, for a certain time, vendors could only exploit the existing part of the new MPEG standard. Now the situation has changed. All Layers are available, the standard is completed, and new systems need not limit themselves, but may capitalize on the full features of MPEG audio. Q: How do I get the MPEG documents? A: You may order it from your national standards body. E.g., in Germany, please contact: DIN-Beuth Verlag, Auslandsnormen Mrs. Niehoff, Burggrafenstr. 6, D-10772 Berlin, Germany Phone: 030-2601-2757, Fax: 030-2601-1231 E.g., in USA, you may order it from ANSI [phone (212) 642-4900] or buy it from companies like OMNICOM phone +44 438 742424 FAX +44 438 740154 Q. How do I join MPEG? A. You don't join MPEG. You have to participate in ISO as part of a national delegation. How you get to be part of the national delegation is up to each nation. I only know the U.S., where you have to attend the corresponding ANSI meetings to be able to attend the ISO meetings. Your company or institution has to be willing to sink some bucks into travel since, naturally, these meetings are held all over the world. (For example, Paris, Santa Clara, Kurihama Japan, Singapore, Haifa Israel, Rio de Janeiro, London, etc.)