EE5359 MULTIMEDIA PROCESSINGFINAL REPORT
Study and implementation of G.719 audio codec and performance analysis of G.719 with AAC(advanced audio codec) and HE-AAC(high efficiency-advanced audio codec).
Student: YashasPrakash
Student ID: 1000803680
Instructor: Dr. K. R. Rao
E-mail:
Date: 05-9-2012.
Project Proposal:
Title: Study and implementation of G.719 audio codec and performance analysis of G.719 with AAC (advanced audio codec) and HE-AAC (high efficiency-advanced audio codec) audio codecs.
Abstract:
This project describes a low-complexity full-band (20 kHz) audio coding algorithm which has been recently standardized by ITU-T (International Telecommunication Union- Telecommunication Standardization Sector) as Recommendation G.719 [1].The algorithm is designed to provide 20 Hz - 20 kHz audio bandwidth using a 48 kHz sampling rate, operating at 32 - 128 kbps. This codec features very high audio quality and low computational complexity and is suitable for use in applications such as videoconferencing, teleconferencing, and streaming audio over the Internet [1]. This technology, while leading toexceptionally low complexity and small memoryfootprint, results in high fullband audio quality,making the codec a great choice for any kind ofcommunication devices, from large telepresencesystems to small low-power devices for mobile communication [2]. A comparison of the widely used AAC and HE-AAC [9] audio codecs is carried out in terms of performance, reliability, memory requirements and applications. A windows media audio file is encoded to 3GP, AAC, HE AAC formats using SUPER (c) [13] software and testing of different coding schemes is carried out for performance, encoding and decoding durations, memory requirements and compression ratios.
List of acronyms
AAC - Advanced audio coding
ATSC - Advanced television systems committee
AES - Audio Engineering Society
DMOS- Dynamic mean opinion score
EBU- European broadcasting union
FLVQ- Fast lattice vector quantization
HE-AAC - High efficiency advanced audio coding
HRQ- Higher rate lattice vector quantization
IMDCT - Inverse modified discrete cosine transform
ISO - International organization for standardization
ITU - International telecommunication union
JAES - Journal of the Audio Engineering Society
LC - Low complexity
LRQ- Lower rate lattice vector quantization
LFE - Low frequencies enhancement
LTP - Long term prediction
MDCT - Modified discrete cosine transform
MPEG - Moving picture experts group
QMF- Quadrature mirror filter
SBR- Spectral band replication
SMR- Symbolic music representation
SRS - Sample rate scalable
TDA- Time domain aliased
WMOPS -Weighted millions operations per second
An Overview of G.719 Audio Codec:
In hands-free videoconferencing and teleconferencing markets, there is strong and increasing demand for audio coding providing the full human auditory bandwidth of 20 Hz to 20 kHz [1], This is because:
Conferencing systems are increasingly used for more elaborate presentations, often including music and sound effects (i.e. animal sounds, musical instruments, vehicles or nature sounds, etc.) which occupy a wider audio band than speech. Presentations involve remote education of music, playback of audio and video from DVDs and VCRs, audio/video clips from PCs, and elaborate audio-visual presentations from, for example, PowerPoint [1].
Users perceive the bandwidth of 20 Hz to 20 kHz as representing the ultimate goal for audio bandwidth. The resulting market pressures are causing a shift in this direction, now that sufficient IP (internet protocol) bitrate and audio coding technology are available to deliver this. As with any audio codec for hands-free videoconferencing use, the requirements include [1]:
Low latency (support natural conversation)
Low complexity (free cycles for video processing and other processing reduce cost)
High quality on all signal types [1].
Block diagram of the G.719encoder:
Figure 1: Block diagram of the G.719 encoder [1].
In figure 1 the input signal sampled at 48 kHz is processed through a transient detector. Depending on the detection of a transient, indicated by a flag IsTransient, a high frequency resolution or a low frequency resolution transform is applied on the input signal frame. The adaptive transform is based on a modified discrete cosine transform (MDCT) [1] in case of stationary frames [1]. For transient frames, the MDCT is modified to obtain a higher temporal resolution without a need for additional delay and with very little overhead in complexity. Transient frames have a temporal resolution equivalent to 5 ms frames [1].
MDCT is defined as follows:
y(k)=transform co-efficient of input frame
~x(n)=time domain aliased signal of the input signal
Block diagram of the G.719 decoder:
Figure 2: Block diagram of the G.719 decoder [1].
A block diagram of the G.719 decoder is shown in Figure 2. The transient flag is first decoded which indicates the frameconfiguration, i.e. stationary or transient. The spectral envelope is then decoded and the same, bit-exact, norm adjustment and bit-allocation algorithms are used at the decoder to re-compute the bit-allocation which is essential for decoding quantization indices of the normalized transform coefficients [1]. After decoding the transform coefficients, the non-coded transform coefficients (allocated zero bits) in low frequencies are regenerated by using a spectral-fill codebook built from the decoded transform coefficients [2].
TRANSFORM COEFFICIENT QUANTIZATION:
Each band consists of one or more vectors of 8-dimensional transform coefficients and the coefficients are normalized by the quantized norm. All 8-dimensional vectors belonging to oneband are assigned the same number of bits for quantization. A fast lattice vector quantization (FLVQ) scheme is used to quantize the normalized coefficients in 8 dimensions. In FLVQ the quantizer comprises two sub-quantizers: a D8-based higher rate lattice vector quantizer (HRQ) and an RE8-based lower-rate lattice vector quantizer (LRQ). HRQ is a multi-rate quantizer designed to quantize the transform coefficients at rates of 2 up to 9 bit/coefficient and its codebook is based on the so-called Voronoi code for the D8 lattice [4]. D8 is a well-known lattice and defined as:
where Z8is the lattice which consists of all points with integer coordinates. It can be seen that D8consists of the points having integer coordinates with an even sum. The codebook of HRQ is constructed from a finite region of the D8lattice and is not stored in memory. The codewords are generated by a simple algebraic method and a fast quantization algorithm is used.
Figure 3:Observed spectrum of different sounds, voiced speech, unvoiced speech and pop music on different audio bandwidths [3].
Figure 3 illustrates how, for some signals, a large portion of energy is beyond the wideband frequency range. While the use of wideband speech codecs primarily addresses the requirement of intelligibility, the perceived naturalness and experienced quality of speech can be further enhanced by providing a larger acoustic bandwidth [3]. This is especially true in applications such as teleconferencing where a high-fidelity representation of both speech and natural sounds enables a much higher degree of naturalness and spontaneity. The logical step toward the sense of being there is the coding and rendering of superwidebandsignals with an acoustic bandwidth of 14 kHz. The response of ITU-T to this increased need for naturalness was standardization of the G.722.1 Annex C extension in 2005 [2]. More recently, this has also led ITU-T to start work on extensions of the G.718, G.729.1, G.722, and G.711.1 codecs to provide super-wideband telephony as extension layers to these wideband core codecs [3].
An overview of MPEG – Advanced Audio Coding
Advanced audio coding (AAC)scheme was a joint development by Dolby, Fraunhoffer, AT&T, Sony and Nokia [9]. It is a digital audio compression scheme for medium to high bit rates which is not backward compatible with previousMPEG audio standards. The AAC encoding follows a modular approach and the standard defines four profiles which can be chosen based on factors like complexity of bitstream to be encoded, desired performance and output.
- Low complexity (LC)
- Main profile (MAIN)
- Sample-rate scalable (SRS)
- Long term prediction (LTP)
Excellent audio quality is provided by AAC and it is suitable for low bit rate high quality audio applications. MPEG – AAC audio coder uses the AAC scheme [9].
HE – AAC [8] also known as AACPlus is a low bit rate audio coder. It is an AAC LC audio coder enhanced with spectral band replication (SBR) technology.
AAC is a second generation coding scheme which is used for stereo and multichannel signals. When compared to the perceptual coders, AAC provides more flexibility and uses more coding tools [12].
The coding efficiency is enhanced by the following tools and they help attain higher quality at lower bit rates [12].
- This scheme has higher frequency resolution with the number of lines increased up to 1024 from 576.
- Joint stereo coding has been improved. The bit rate can be reduced frequently owing to the flexibility of the mid or side coding and intensity coding.
- Huffman coding [12] is applied to the coder partitions.
An overview of spectral band replication technology in AACplus audio codec
Spectral band replication (SBR) is a new audio coding tool that significantly improves the coding gain of perceptual coders and speech coders. Currently, there are three different audio coders that have shown a vast improvement by the combination with SBR: MPEG-AAC, MPEG-Layer II and MPEG-Layer III (mp3), all three being parts of the open ISO-MPEG standard. The combination of AAC and SBR will be used in the standardized Digital audio Mondiale system, and SBR is also being standardized within MPEG-4 [15].
Block diagram of SBR encoder:
Figure 4: Block diagram of the SBR encoder [15]
The basic layout of the SBR encoder is shown in figure 4. The input signal is initially fed to a down-sampler, which supplies the core encoder with a time domain signal having half the sampling frequency of the input signal. The input signal is in parallel fed to a 64-channel analysis QMF bank. The outputs from the filter bank are complex-valued sub-band signals. The sub-band signals are fed to an envelope estimator and various detectors. The outputs from the detectors and the envelope estimator are assembled into the SBR data stream. The data is subsequently coded using entropy coding and, in the case of multichannel signals, also channel-redundancy coding. The coded SBR data and a bitrate control signal are then supplied to the core encoder. The SBR encoder interacts closely with the core encoder. Information is exchanged between the systems in order to, for example, determine the optimal cutoff frequency between the core coder and the SBR band. The core coder finally multiplexes the SBR data stream into the combined bitstream [15].
Block diagram of the SBR decoder:
figure 5: Block diagram of the SBR decoder [15]
Figure 5 illustrates the layout of the SBR enhanced decoder. The received bitstream is divided into two parts: the core coder bitstream and the SBR data stream. The core bitstream is decoded by the core decoder, and the output audio signal, typically of lowpass character, is forwarded to the SBRdecoder together with the SBR data stream. The core audio signal, sampled at half the frequency of the original signal, is first filtered in the analysis QMF bank. The filter bank splits the time domain signal into 32 sub-band signals. The output from the filter bank, i.e. the sub-band signals, are complex-valued and thus oversampled by a factor of two compared to a regular QMF bank [15].
SUBJECTIVE PERFORMANCE OF G.719
Subjective tests for the ITU-T G.719 Optimization/Characterization phase were performed from mid February through early April 2008 by independent listening laboratories in American English. According to a test plan designed by ITU-T Q7/SG12 [23] experts, the joint candidate codec conducted two experiments as follows:
Experiment 1: Speech (clean, reverberant, and noisy)
Experiment 2: Mixed content and music
Mixed content items are representative of advertisement, film trailers, news with jingles, music with announcements and contain speech, music, and noise. Each experiment used the “triple stimulus/hidden reference/double blind test method” described in ITU-R Recommendation BS.1116-1 [23]. A standard MPEG audio codec, LAME MP3 version 3.97 as found on the LAME website was used as the reference codec in the subjective tests. The ITU-T requirement was that the G.719 candidate codec at 32, 48, and 64kHz be proven “Not Worse Than” the reference codec at 40, 56, and 64 kHz, respectively, with a 95% statistical confidence level. In addition, the G.719 candidate codec at 64 kHz was also tested against the G.722.1C codecs at 48 kHz for Experiment 2. The subjective test results for the G.719 codec are shown in Figures 6-8. Statistical analysis of the results showed that theG.719 codec met all performance requirements specified for the subjective Optimization/Characterization test. For experiment 1 the G.719 codec was better than thereference codec at all bit rates. For experiment 2 the G.719 codec is better than the reference codec at the lowest bit rate for all the items and at the two other bitrates for most of the items. An additional subjective listening test for the G.719 codec was conducted later to evaluate the quality of the codec at rates higher than those described in the ITU-T test plan. Because thequality expectation of the codec at these high rates is high, a pre-selection of critical items, for which the quality at the lower bitrate range was most degraded, was conducted prior to testing. The test results are shown in Figure 6. It has been proven that transparency was reached for critical material at 128 kHz.
Figure 6: subjective test results experiment 1 [7]
Figure 7: subjective test results experiment 2 [7]
Figure 8: additional subjective tests [7]
Algorithmic efficiency
The G.719 codec has a low complexity and a low algorithmic delay [1]. The delay depends on the frame size of 20 milliseconds and the look-ahead of one frame used to form the transform blocks. Hence, the algorithmic delay of the G.719 codec is 40 milliseconds. The algorithmic delay of comparable codecs as 3GPP eAAC+ [14] and 3GPP AMR-WB+ are significantly higher. For AMR-WB+ the algorithmic delay for mono coding is between 77.5 and 227.6 ms depending on the internal sampling frequency. For eAAC+ the algorithmic delay is 323 ms for mono coding with 32 kbps and 48 kHz sampling rate. In Table 1the average and worst-case complexity of G.719 is expressed in Weighted Millions Operations Per Second (WMOPS). The figures are based on complexity reports using the basic operators of ITU-T STL2005 Software Tool Library v2.2 [7]. For comparison, the complexity of the three comparable audio codecseAAC+, AMR-WB+ and ITU-T G.722.1C [8], the low-complexity super-wideband codec (14 kHz) that G.719 was developed from, is shown in Table 2. The memory requirements of G.719 are presented in Table 3. The delay and complexity measures show that the G.719 codec is very efficient in terms of complexity and algorithmic delay especially when compared to eAAC+ and AMR-WB+.
Frame buffering and windowing with overlap
A time-limited block of the input audio signal can be seen as windowed with a rectangular window. The windowing that is a multiplication in the time-domain becomes in the frequency domain a convolution and results in a large frequency spread for this window. In addition the sampling theorem states that the maximal frequency that can be correctly represented in discrete time is the Nyquist frequency, i.e. half of the sampling rate, otherwise aliasing occurs. For example in a signal sampled at 48 kHz a frequency of 25 kHz, i.e. 1 kHz above the Nyquist frequency of 24 kHz, will be analyzed as 23 kHz due to the aliasing. Due to the large frequency spread of the rectangular window the frequency analysis can be contaminated by the aliasing. In order to reduce the frequency spread and suppress the aliasing effect windows without sharp discontinuities can be used. Two examples are the sine and the Hann windows, defined in [17], that compared to the rectangular window indeed have a larger attenuation of the side lobes but also a wider main lobe. This is illustrated in Figure 10where the shape of the windows and the corresponding frequency spectrum can be observed. Conclusively, there has to be a trade-off between the possible aliasing and the frequency resolution.
Figure 9:Three window functions and their corresponding frequency spectrum.The windows are 1920 samples long at a sampling rate of 48 kHz [17]
In the synthesis of the analysed and encoded blocks of a processed audio signal the window effects has to be cancelled. For example the inverse window function could be applied to the coded time-domain blocks but there is a high possibility that artefacts can be audible near the block edges due to discontinuities and amplification of the coding errors. In order to reduce the block artefacts overlap-add techniques are commonly used [17].
In ITU-T G.719 the blocks of two consequent frames are windowed with a sine window of
length 2N = 1920 samples that is defined by:
The signals are processed with an overlap in the data of 50% between consecutive blocks.
The windowed signal of each block is given by: