From XiphWiki
< User:Gmaxwell
Revision as of 16:09, 5 March 2009 by Gmaxwell (talk | contribs) (voice of the confused copy)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

If you have ever sat down in front of a good stereo system playing a good recording you've probably had the "They are here" experience: A moment where, if you close your eyes, you would be convinced that the performer was in the room with you. Yet, users of teleconferencing systems, even expensive systems professionally installed into dedicated rooms, never suffer that kind of pleasant confusion.

There are three primary reasons why teleconferencing and VoIP systems don't match the realism of a good stereo system:

Audio Quality

A compact disc recording captures audio frequencies up to 22 kHz with over 90dB of instantaneous signal to noise ratio. If we ignore the 'spatial' component in listening the quality provided by a CD is very close to the limits of human perception. It's hard to do better, which is one reason why better than CD quality technology has not seen much adoption outside of the studio, where it is important to have higher quality avoid the accumulation of error as audio is processed.

On the flip side, a telephone permits only up to 4 kHz and that with substantially lower SNR. A 4kHz band-pass provides acceptable intelligibility for speech, although the "s" and "f" sounds can become confused, but it is clearly not high fidelity. Some new voice over the internet and teleconferencing systems allow for "wideband audio" which captures an 8 kHz bandpass. This is a major improvement but energy over 8 kHz, while not important for intelligibility is critical for providing a sense of presence and immediacy: Higher frequencies are attenuated faster through air, so a voice without higher frequency energy sounds like someone far away. Frequencies over 8 kHz are also important for music, since common instruments have harmonics extending all the way past the limits of the human ear.

Making matters worse, most modern communication systems apply compression algorithms to keep the network traffic requirements reasonable. Done well this compression is hardly noticeable, but the compression algorithms used for VoIP were designed for speech and focused on providing good intelligibility rather than the best total quality. Background noise and music suffer cause significant problems for these codecs.

CELT solves the audio quality problem for teleconferencing by supporting the full range of human hearing. When enough network capacity is available it can offer quality indistinguishable from a CD and even when network capacity is tight CELT can give near CD quality with less network usage than an uncompressed 4 kHz bandpass traditional telephone call.


We live in a three dimensional world and sound is constantly bombarding us from all directions. Even when you're listening to a person directly in front of you there are and reverberation are coming from all sides. These effects are subtle but important to perception. Stereo systems take advantage of the fact that humans have only two ears to get a passable impression of spaciousness with only two speakers, but most VoIP systems send only a single 'monaural' channel.

CELT solves the spatiality problem for VoIP by supporting multichannel audio at network traffic levels which make it an easy decision. CELT can be used to build three dimensional surround sound conferencing using technology like Ambisonics which can provide realism beyond what is possible with a stereo system, systems that go beyond "they are here" and reach the level of "you are there".


Sound takes time to travel through the air: It takes about 3ms for sound to travel a meter. Like sound, data takes time to travel through networks. Light travels much faster in optical cables (around 200km per ms) but it still can take considerable time to travel from city to city, and in network audio many other components such as audio hardware, the computer operating system, routers, and jitter buffers add their own delay. Most significantly, typical audio compression algorithms add non-trivial delay. The popular algorithims for music such as MP3, AAC, and Vorbis have delays of hundreds of milliseconds which make them completely unsuitable for teleconferencing and VoIP. Even the speech focused algorithms designed for VoIP have delays typically on the order of 25-40 ms.

When the delay of a communication system is too high the participants will talk over each other and annoying echo and feedback can occur necessitating computationally expensive and potentially quality degrading echo-cancellation. Recent studies have shown that musicians performing together remotely can only tolerate a total one way delay on the order of 20-25 ms before their performance begins to degrade.

CELT helps the latency problem for VoIP by imposing a delay of only 8ms (typically, as low as 2ms is supported). CELT is able to reach the same kinds of latencies you would experience sitting in a room with someone. By using CELT a VoIP system can keep the total delay down to an acceptable level, even though other sources of delay like operating system relayed delays or the speed of light may be non-negotiable.

CELT makes applications like musicians jamming remotely reasonable when they weren't reasonable before.

A stereo system doesn't provide low delay, but it doesn't need to: You aren't interacting with the performers. It is this delay requirement that cause us to create CELT when Xiph.Org already has the Ogg/Vorbis format, which offers great quality but delays too high for VoIP. In many ways delay is even more important than quality in creating a realistic impression for teleconferencing.

There is a final reason why you should use CELT, one which is true for all Xiph.Org formats:

Many popular formats for music like MP3, AAC, and speech like G.729 and AMR-WB are owned by companies whom charge significant fees for people to use them. If someone develops software which use these formats without paying licensing fees they and their users are at risk of surprise fees or litigation even if the software was distributed at no cost. Worse, since the formats cost money different companies adopt different formats and leverage them against their competition, creating fragmentation and needless incompatibility. We don't begrudge the developers of these formats the ability to profit from their creations, but we'd like to be able to communicate with our friends across the internet without paying a tax, even indirectly, for the privileged of using some format or breaking the law. We'd like to be able to create music and experience culture without being forced to deal with some technological middleman's fee collection or DRM system. We think you should have that freedom too.

CELT is released under a 'BSD-style' license which allows anyone to use it for any purpose, even in commercial and proprietary software, without paying any fees, obtaining any further permission, or requiring any other material concession. CELT is assembled with care predominately from technological components too old to be patented, and designed with the express intention of avoiding patented techniques. We've invented new, superior, mathematical techniques to avoid potentially encumbered ones, and then rediscovered the same 'new' techniques forgotten in decades old scientific literature. Perhaps most significantly, CELT offers performance for its intended uses significantly better than any of the existing potential competition, free or otherwise. Since there is no large installed base of VoIP users using a patent encumbered CELT competitor you can use CELT without the unfortunate cost of incompatibility which have troubled some other free formats.