4.4 Mp3 and aac: mdct processing

In MP3 and AAC coders, the frequency resolution of the polyphase quadrature filterbank is increased using a cascaded MDCT stage. We describe that here, and give the details of the MDCT stage.

Mdct filterbanks

• Hybrid Filter Banks: In more advanced audio coders such as MPEG “Layer-3” or MPEG“Advanced Audio Coding” (the details of which will be discussed later), the 32-band polyphase quadrature filterbank (PQF) is thought to not giveadequate frequency resolution, and so an additional stage of frequency division is cascaded onto the output of the PQF.This additional frequency division is accomplished using the so-called “Modified DCT” (MDCT) filterbank.(See [link] .)
• Lapped Transforms: The MDCT is a so-called “lapped transform.”At the encoder, blocks of length $2Q$ which overlap by Q samples are windowed and transformed, generating Q subband samples each. At the decoder, the Q subband samples are inverse-transformed and windowed.The windowed output samples are overlapped with and added to the previous Q windowed outputs to form the output stream. [link] gives an intuitive view of the coding/decoding operation, while [link] and [link] specify the specific coder/decoder implementations used in the MPEG schemes.
• Perfect Reconstruction: Based on the cancellation of time-domain aliasing components, Princen, Johnson,&Bradley show (in ICASSP 87 and TASSP 86 papers) that the MDCT acheives perfect-reconstruction when window $\left\{{w}_{n}\right\}$ is chosen so that overlapped squared copies sum to one, i.e.,
$1\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}{w}_{n+Q}^{2}+{w}_{n}^{2}\phantom{\rule{1.em}{0ex}}\text{for}\phantom{\rule{1.em}{0ex}}0\le n\le Q-1.$
The “sine” window
${w}_{n}\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}sin\left(\frac{\pi }{2Q},n\right)\phantom{\rule{1.em}{0ex}}\text{for}\phantom{\rule{1.em}{0ex}}0\le n\le 2Q-1$
is one example of a window satisfying this requirement, and it turns out to be the one used in MPEG Layer-3.
• Frequency Resolution: With a window length that is only twice the number of transformoutputs, we cannot expect very good frequency selectivity. But, it turns out that this is not a problem.In MPEG Layer-3, sine-window MDCTs appear at the outputs of a 32-band PQF where frequency selectivity is not a critical issue due to thelimited frequency resolution of the human ear. In MPEG AAC, a 4-band PQF in conjunction with an optimized MDCT windowfunction gives frequency selectivity just above that which current psychoacoustic models deem necessary (see M. Bosi et al., "ISO/IEC MPEG-2 Advanced Audio Coding" in JAES Oct 1997).
• Window Switching: Larger values of Q lead to increased frequency resolution but decreased time resolution.Time resolution is linked to the following: error due to the quantization of one MDCT output is spread out over $\approx 2QN$ time-domain output samples. For signals of a transient nature, choosing $QN$ too high leads to audible “pre-echoes.”For less transient signals, on the other hand, the same value of $QN$ might not be perceptible (and the increased frequency resolution might be very beneficial).Hence, most advanced coding schemes have a provision to switch between different time/frequency resolutions depending on localsignal behavior. In MPEG Layer-3, for example, Q switches between 6 and 18. This is accomplished using a sine window of length 36, a sinewindow of length 12, and intermediate windows which are used to switch between the long and short windows while retaining theperfect reconstruction property. [link] shows an example window sequence.

