[转]next generation audio: CELT update 20101223

http://people.xiph.org/~xiphmont/demo/celt/demo.html

Overview

The Vorbis codec is more than halfway through its approximate intended lifetime of 20 years or so, and the state of the art in audio coding has improved considerably since Vorbis's introduction. Xiph has been developing two new, next-generation codecs (Ghost and CELT) as successors to Vorbis. Ghost research was postponed until recently to devote more resources to improving video (see the 'Thusnelda' and 'Ptalarbvorm' encoders), but Jean-Marc Valin of Xiph's Speex project has been able to continue work on CELT since late 2007. As of December 2010, CELT is nearing bitstream freeze and has been submitted to the IETF codec working group as an input codec.

The latest version of the CELT reference library implementation can always be downloaded from http://celt-codec.org/downloads.

What is CELT?

CELT is a general purpose, low-delay codec intended for similar use and performance cases as Vorbis, but with the additional features of very low delay and low CPU/memory requirements. CELT supports stereo, can achieve a total algorithmic delay as low as 5ms, scales well to lower bitrates than Vorbis, and currently provides superior audio fidelity to Vorbis on many if not most natural audio inputs. From 24kbps through 64kbps 48kHz stereo, it is comparable quality to HE-AAC v1 and provides considerably higher audio fidelity than AAC-LD with equal or much lower delay.

By way of general feature summary:

Headerless
Arbitrary sampling rate
Mono/stereo encoding
Fixed-point encode and decode
VBR/constrained VBR/true CBR
Very low delay (arbitrary delay, 5ms minimum total latency at 48kHz)
low-bitrate performance ('sweet spot' >= 32kbps for 48kHz stereo)
flexible streaming with the ability to change most codec parameters mid-stream (so long as changes are signaled; this information is not in-band in CELT, though it is in-band for OPUS. More about OPUS later).
royalty-free, no licensing required, BSD reference code

While primarily targeted at packet oriented networks, CELT includes a number of design features that increase robustness to bit errors, also making it suitable for non-IP wireless applications.

Why low delay?

The minimum algorithmic delay for a typical Vorbis encoding mode is over 100ms; in current encoders it is actually considerably higher (between 200 and 400ms). This delay is typical of other general-purpose audio codecs (mp3, AAC, etc) not intended for realtime telepresence applications. Low delays, typically 20ms or ideally much less, are a hard requirement for applications such as collaborative music. Although speech applications typically tolerate latencies of around 100ms, even here lower latencies can make interaction more natural and less stressful for the speaker and listener.

Sample 1: 250ms delay between performers

Sample 2: 15ms delay between performers

These samples illustrate the difficulties caused by latency in realtime communication. Two participants attempt to sing 'Row, Row, Row Your Boat' together over telepresence links with latencies of 250ms (above) and 15ms (below). High latencies make the task fairly difficult.

This is similar to the 'cell phone collisions' many cell phone users experience when the lower but still significant (~100ms) delays over a cell phone link result in both speakers repeatedly beginning to speak at the same time, both stopping to let the other speaker continue, and then both beginning to speak again resulting in another collision. Lather, rinse, repeat.

Since low delay is clearly a desirable trait in a codec, the obvious question would be, "why not design every codec to be low latency?" Unfortunately, low latency impacts codec efficiency. High latencies allow considerably higher energy compaction (and thus coding efficiency) as well as deeper analysis of input signals. Low latency design is more difficult for a comparable level of bitrate performance.

CELT Design Overview

Simplified CELT block diagram

"CELT" stands for "Constrained Energy Lapped Transform" an accurate and remarkably unforced acronym. It is exactly that: A lapped transform codec with a psychoacoustic design philosophy based on band-energy preservation.

Lapped transform

[转]next generation audio: CELT update 20101223 CELT is a lapped transform-domain codec like Vorbis and AAC, however it uses quite short windows with low overlap to achieve low latency.

Despite the small windows and short overlap, transient pre-echo suppression occasionally demands yet shorter windows. In this case, the frame is split and smaller MDCTs are done on each piece. The results are then interleaved and coded as normal.

Critical bands

[转]next generation audio: CELT update 20101223

The spectral lines produced by each transform are grouped and coded by critical band. This both holds coding noise within critical bands and also provides approximately correct band energy resolution.

Constrained Energy

The single most important new discovery in Vorbis was that preserving narrowband energy produces far superior results to earlier techniques that attempted to globally minimize quantization noise. This was a relatively late discovery in the Vorbis project, and although it was easy enough to add energy preservation to the Vorbis encoder ('Noise Normalization'), Vorbis did not incorporate energy preservation as an inherent design concept.

CELT's design assumes unity narrowband energy gain throughout. The absolute energy of each band is explicitly coded, and every entropy-backend codeword also encodes unity energy. Critical band spectral energy and the coarse shape of the spectral envelope is thus preserved no matter what.

Variable Time-Frequency Resolution

In addition to increasing time resolution via frame splitting, CELT can also further adjust time/frequency resolution by performing Hadamard transforms in one band. A forward Hadamard transform over several blocks increases frequency resolution and an inverse transform in one block increases time resolution (though with more temporal leakage than via frame splitting). TF adjustment is signaled per band and used to further bias a frame toward more accurately encoding tonal or transient content.

norm-energy PVQ range encoder

[转]next generation audio: CELT update 20101223

CELT encodes energy in each band explicitly. The spectral residue of each band is quantized as a whole band using a fixed number of spectral energy 'pulses' (K). These pulses are amplitude (not energy) quanta that total an amplitude of 1.; each pulse represents an amplitude of 1./K. Each codeword is thus an N-dimensional vector of integer magnitudes that sum to K. The codeword space is obviously countable, representing points on the surface of an N-1 orthoplex (the dual of a hypercube; a 3 dimensional orthoplex is an octahedron).

The astute reader will notice that in the above explanation, each codeword represents a fixed summed amplitude, not a fixed energy. The orthoplex is warped such that the energy of each vector is normalized to an energy of 1., inflating the vectors to points on an N-1 sphere. The direction of each vector is not altered, resulting in higher resolution at the 'poles'. In this form, the codewords also turn out to have approximately flat probability, eliminating the need for entropy encoding of residual data.

The specific implementation of this coding technique used by CELT is known as Pyramid Vector Quantization (Fischer, 1986) The design neatly sidesteps any need for Vorbis-like codebooks in residue coding.

Pulse Spreading

[转]next generation audio: CELT update 20101223

If too much diffuse energy in a band collapses into just a few pulses due to very low bitrate coding, this causes the classic swirling/metallic artifacts typical of transform codecs. These artifacts are mostly associated with mp3, which has the least ability to mitigate the problem.

Following the equation above, spectral collapse happens primarily when K is small, causing an audibly sparse spectrum. Spreading essentially jitters pulses around as a kind of spectral dither; if and when low-bitrate encoding collapses a noisy spectrum into just a few pulses after the forward spreading filter, the inverse filter in the decoder 'unjitters' the collapsed energy, spreading it back out across the narrowband spectrum.

It might not be obvious from the description above, but spreading is purely a forward/inverse filtering operation; there is no additional side information transmitted. The only additional signaling is whether folding is enabled or disabled.

Band Folding

This has a similar effect to Spectral Band Replication (SBR), except that we don't replicate bands, we just reuse residue codewords from lower bands, reconstituted in the context of the encoded energy of the higher band. Much simpler in concept and execution than SBR, a lucky break of the PVQ design.

Left: an illustration of constrained energy, pulse spreading, and band folding in action. The spectrogram shows the original mono sample, MP3 at 32kbps, Vorbis at 32kbps and CELT at 32kbps. Vorbis also preserves band energy but does not have the additional spreading and folding mechanisms.

Click on the label to show the spectrogram for each codec.