Difference between revisions of "Ghost"
(obsolete too???) 

(One intermediate revision by one other user not shown)  
(No difference)

Latest revision as of 23:00, 12 November 2013
This page is meant to track ideas about lowdelay, highquality audio coding. The work has just started, so don't expect anything in the near future (or at all for that matter).
Contents
Signal types
There are many signal types that can be found:
 Sinusoids
 A few pure (or nearly pure) tones
 Harmonic
 Periodic waveforms (e.g. voice)
 Many (sometimes closely spaced) harmonics
 Shapred noise
 Signals that are (or are indistinguishable from) filtered (coloured) white noise
 Transients
 Whatever doesn’t fit above I guess
Signal analysis
Sinusoidal
Good when most of the energy is contained in a few sinusoids. May be problematic for very harmonic signals, e.g. a male voice may have close to a hundred harmonics in the full audio band.
Pitch
Good for harmonic signals. Hard to estimate and code when extra sinusoids and noise are present. At 48 kHz, no need for fractional pitch or anything like that, but subband pitch analysis or multitap gain is a good idea. Also, there needs to be a way to remove the effect of sinusoids and noise. Even then removing the "noise" also means removing all excitation to the pitch predictor, so that's a problem.
MDCT
Very general. Can code anything, but not very good at anything. High delay (2x frame size). Could put several "MDCT frames" in each codec frame to make latency smaller.
Wavelets
Just a fancy name for subbands with nonuniform width. Probably similar to having an MDCT with few subbands, except that that the subbands could follow (roughly) the critical bands.
LPC + stochastic cb
Like CELP with no pitch. Could be used to code the noisy part of the signal with low bitrate. Would need to figure out how to preserve the energy of the noise when going with 1/2 bit per sample and less.
Codec Structure Ideas
Sinusoidal + wavelet
 Preemphasis
 Extract as many sinusoids as possible
 Wavelet transform
 Code wavelet coefs using VQ
Sinusoidal, pitch and noise
 Preemphasis
 Joint pitch + sinusoidal estimation
 LPC analysis
 CELPlike coding of the residual (mainly noise)
Estimation Ideas
Sinusoid Estimation
Very hard to do properly, especially with reasonable complexity and low delay. Some ideas:
Leastsquare type matching
Step one: estimate sinusoid frequencies.
Tried so far:
 MUSIC fails on nontrivial signals and very complex, although there's an AES paper that recommends first whitening the noise part of the signal before applying the algo. Haven't tried that so far.
 ESPRIT fails on nontrivial signals and very complex (see above for possible solution)
 LPC would probably work, but requires an insane order > impractical, plus it tends to be numerically unstable anyway.
 FFT poor resolution, but that's all we have left so far. There's an AES paper that describes a sort of timedomain phase unwrapping that could help.
Step two: what to match
Step three: solving
Looks like it's possible to solve an NxM least square problem in O(N*M) time using an iterative algorithm as long as the system matrix is nearorthogonal. If we want to solve Ax=b and A^h*A ~= I, then we start with x(0)=A^h*b and then:
 x(N+1) = x(N) + A^h*(bA*x(N))
Phase lock loop (PLL)
Quantization Ideas
After the sinusoids have been extracted they have to be quantized. The possible ways are
 Sort the sinusoids according to energy and transmit only a finite number or only ones with a specific energy or above. The indices of the sinusoids before rearranging will have to be sent.
 I think it's worth checking which is most efficient. Sorting the sinusoids will help quantizing the amplitude, but make it harder to encode frequency. Jmspeex 05:45, 28 June 2006 (PDT)
 Use the psycho acoustic properties and remove all the sinusoids, which will be masked by other tones.
 Of course, we don't want to encode perceptually irrelevant sinusoids. Actually, we want the resolution (in amplitude, phase and probably frequency) to scale with the amplitudetomask ratio or something like that. Jmspeex 05:45, 28 June 2006 (PDT)
 After removing perceptually irrelevant and lowenergy tones the energy in each critical bands has to be adjusted to match with the initial energy.
 Possibly  I don't know much on that topic. Monty probably has valuable experience. Jmspeex 05:45, 28 June 2006 (PDT)
 Timedifferential coding of sinusoids across frames can be used
 Definitely. This is very important if we plan on using short frames. It would be important to minimize interframe redundancy, but still make it possible to recover from packet loss. For that, we could either use a leaky predictor (like the pitch in CELP) or use keyframes (like a video codec). Jmspeex 05:45, 28 June 2006 (PDT)
Quantization of frequencies
 Quantize frequencies of a few selected sinusoids and recreate other values using interpolation.
 How would you do that? (maybe I'm not following here) Jmspeex 05:45, 28 June 2006 (PDT)
Quantization of Amplitudes
 Model the energy curve of the sinusoids – for instance using an exponential curve
 Exponential decay might be a good way to do interframe prediction. Jmspeex 05:45, 28 June 2006 (PDT)
 Quantize amplitudes of a few selected sinusoids and recreate other values using interpolation.
 Possibly, but probably not at first (hard problem). Jmspeex 05:45, 28 June 2006 (PDT)
Quantization of phase and modulation parameters
 Can be scalar quantized with the number of bits allocated being proportional to the energy of the sinusoid
 Yes. Also, this is something that can be predicted very well across frames. It's not even necessary to make that one robust to losses, because as long as the phase is continuous, no one will notice Jmspeex 05:45, 28 June 2006 (PDT)
Quantization of indices
Quantization of energy gains in critical bands
Excitation similarity weighting
The idea behind the ESW technique is to select sinusoids such that each new sinusoid added will provide a maximum incremental gain in matching between the auditory excitation pattern associated with the original signal and the auditory excitation pattern associated with the modeled signal. In order to accomplish this goal, an iterative process is proposed in which each sinusoid extracted during conventional analysis is assigned an excitation similarity weight. During each iteration, the sinusoid having the largest weight is added to the modeled representation. New sinusoids are accumulated until some constrain is exhausted, for example, a bit budget. The algorithm tends to converge as the number of modeled sinusoids increases
 Not sure I understand here. Any reference? Jmspeex 05:45, 28 June 2006 (PDT)
Trajectory tracking
Once the meaningful sinusoidal peaks and their parameters have been estimated, the peaks are tracked together into interframe trajectories. At each frame, a peak continuation algorithm tries to connect the sinusoidal peak into the already existing trajectories at the previous frame, resulting into a smooth curve of frequencies and amplitudes. The continuation was tested with two algorithms: the traditional one which uses only the parameters of the sinusoids to obtain smooth trajectories and one original method which synthesizes the possible continuations inside certain deviation limits and compares them to the original signal. There is also other systems which use more advanced methods, for example the Hidden Markov Models to track the trajectories. Sinusoidal trajectories contain all the information needed for the reconstruction of the harmonic parts of input signals: amplitudes, frequencies and phases of each trajectory at each frame. To avoid discontinuities at frame boundaries, the amplitudes, frequencies and phases are interpolated from frame to frame.
 Amplitudes are linearly interpolated
 Phase interpolated with cubic polynomials
 Any reference? Jmspeex 05:45, 28 June 2006 (PDT)