# Difference between revisions of "Ghost"

(16 intermediate revisions by 8 users not shown) | |||

Line 1: | Line 1: | ||

− | + | {{historical}} | |

This page is meant to track ideas about low-delay, high-quality audio coding. The work has just started, so don't expect anything in the near future (or at all for that matter). | This page is meant to track ideas about low-delay, high-quality audio coding. The work has just started, so don't expect anything in the near future (or at all for that matter). | ||

Line 5: | Line 5: | ||

There are many signal types that can be found: | There are many signal types that can be found: | ||

− | * Sinusoids ( | + | * Sinusoids |

− | * Periodic waveforms (e.g. voice) | + | ** A few pure (or nearly pure) tones |

− | * | + | * Harmonic |

+ | ** Periodic waveforms (e.g. voice) | ||

+ | ** Many (sometimes closely spaced) harmonics | ||

+ | * Shapred noise | ||

+ | ** Signals that are (or are indistinguishable from) filtered (coloured) white noise | ||

* Transients | * Transients | ||

+ | ** Whatever doesn’t fit above I guess | ||

== Signal analysis == | == Signal analysis == | ||

− | + | === Sinusoidal === | |

− | + | ||

− | + | Good when most of the energy is contained in a few sinusoids. May be problematic for very harmonic signals, e.g. a male voice may have close to a hundred harmonics in the full audio band. | |

− | + | ||

− | + | === Pitch === | |

+ | |||

+ | Good for harmonic signals. Hard to estimate and code when extra sinusoids and noise are present. At 48 kHz, no need for fractional pitch or anything like that, but sub-band pitch analysis or multi-tap gain is a good idea. Also, there needs to be a way to remove the effect of sinusoids and noise. Even then removing the "noise" also means removing all excitation to the pitch predictor, so that's a problem. | ||

+ | |||

+ | === MDCT === | ||

+ | |||

+ | Very general. Can code anything, but not very good at anything. High delay (2x frame size). Could put several "MDCT frames" in each codec frame to make latency smaller. | ||

+ | |||

+ | === Wavelets === | ||

+ | |||

+ | Just a fancy name for sub-bands with non-uniform width. Probably similar to having an MDCT with few sub-bands, except that that the sub-bands could follow (roughly) the critical bands. | ||

+ | |||

+ | === LPC + stochastic cb === | ||

+ | |||

+ | Like CELP with no pitch. Could be used to code the noisy part of the signal with low bit-rate. Would need to figure out how to preserve the energy of the noise when going with 1/2 bit per sample and less. | ||

+ | |||

+ | == Codec Structure Ideas == | ||

+ | |||

+ | === Sinusoidal + wavelet === | ||

+ | |||

+ | * Preemphasis | ||

+ | * Extract as many sinusoids as possible | ||

+ | * Wavelet transform | ||

+ | * Code wavelet coefs using VQ | ||

+ | |||

+ | === Sinusoidal, pitch and noise === | ||

+ | |||

+ | * Preemphasis | ||

+ | * Joint pitch + sinusoidal estimation | ||

+ | * LPC analysis | ||

+ | * CELP-like coding of the residual (mainly noise) | ||

+ | |||

+ | == Estimation Ideas == | ||

+ | |||

+ | === Sinusoid Estimation === | ||

+ | |||

+ | Very hard to do properly, especially with reasonable complexity and low delay. Some ideas: | ||

+ | |||

+ | ==== Least-square type matching ==== | ||

+ | |||

+ | Step one: estimate sinusoid frequencies. | ||

+ | |||

+ | Tried so far: | ||

+ | * MUSIC fails on non-trivial signals and very complex, although there's an AES paper that recommends first whitening the noise part of the signal before applying the algo. Haven't tried that so far. | ||

+ | * ESPRIT fails on non-trivial signals and very complex (see above for possible solution) | ||

+ | * LPC would probably work, but requires an insane order -> impractical, plus it tends to be numerically unstable anyway. | ||

+ | * FFT poor resolution, but that's all we have left so far. There's an AES paper that describes a sort of time-domain phase unwrapping that could help. | ||

+ | |||

+ | Step two: what to match | ||

+ | |||

+ | Step three: solving | ||

+ | |||

+ | Looks like it's possible to solve an NxM least square problem in O(N*M) time using an iterative algorithm as long as the system matrix is near-orthogonal. If we want to solve '''Ax'''='''b''' and '''A'''^h*'''A''' ~= I, then we start with '''x'''(0)='''A'''^h*'''b''' and then: | ||

+ | |||

+ | :'''x'''(N+1) = '''x'''(N) + '''A'''^h*('''b'''-'''A'''*'''x'''(N)) | ||

+ | |||

+ | ==== Phase lock loop (PLL) ==== | ||

+ | |||

+ | == Quantization Ideas == | ||

+ | After the sinusoids have been extracted they have to be quantized. The possible ways are | ||

+ | * Sort the sinusoids according to energy and transmit only a finite number or only ones with a specific energy or above. The indices of the sinusoids before rearranging will have to be sent. | ||

+ | ** I think it's worth checking which is most efficient. Sorting the sinusoids will help quantizing the amplitude, but make it harder to encode frequency. [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT) | ||

+ | * Use the psycho acoustic properties and remove all the sinusoids, which will be masked by other tones. | ||

+ | ** Of course, we don't want to encode perceptually irrelevant sinusoids. Actually, we want the resolution (in amplitude, phase and probably frequency) to scale with the amplitude-to-mask ratio or something like that. [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT) | ||

+ | * After removing perceptually irrelevant and low-energy tones the energy in each critical bands has to be adjusted to match with the initial energy. | ||

+ | ** Possibly -- I don't know much on that topic. Monty probably has valuable experience. [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT) | ||

+ | * Time-differential coding of sinusoids across frames can be used | ||

+ | ** Definitely. This is very important if we plan on using short frames. It would be important to minimize inter-frame redundancy, but still make it possible to recover from packet loss. For that, we could either use a leaky predictor (like the pitch in CELP) or use key-frames (like a video codec). [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT) | ||

+ | |||

+ | ==== Quantization of frequencies==== | ||

+ | * Quantize frequencies of a few selected sinusoids and recreate other values using interpolation. | ||

+ | ** How would you do that? (maybe I'm not following here) [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT) | ||

+ | ==== Quantization of Amplitudes ==== | ||

+ | * Model the energy curve of the sinusoids – for instance using an exponential curve | ||

+ | ** Exponential decay might be a good way to do inter-frame prediction. [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT) | ||

+ | * Quantize amplitudes of a few selected sinusoids and recreate other values using interpolation. | ||

+ | ** Possibly, but probably not at first (hard problem). [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT) | ||

+ | ==== Quantization of phase and modulation parameters ==== | ||

+ | * Can be scalar quantized with the number of bits allocated being proportional to the energy of the sinusoid | ||

+ | ** Yes. Also, this is something that can be predicted very well across frames. It's not even necessary to make that one robust to losses, because as long as the phase is continuous, no one will notice [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT) | ||

+ | ==== Quantization of indices ==== | ||

+ | ==== Quantization of energy gains in critical bands ==== | ||

+ | |||

+ | === Excitation similarity weighting === | ||

+ | The idea behind the ESW technique is to select sinusoids such that each new sinusoid added will provide a maximum incremental gain in matching between the auditory excitation pattern associated with the original signal and the auditory excitation pattern associated with the modeled signal. In order to accomplish this goal, an iterative process is proposed in which each sinusoid extracted during conventional analysis is assigned an excitation similarity weight. During each iteration, the sinusoid having the largest weight is added to the modeled representation. New sinusoids are accumulated until some constrain is exhausted, for example, a bit budget. The algorithm tends to converge as the number of modeled sinusoids increases | ||

+ | |||

+ | -- Not sure I understand here. Any reference? [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT) | ||

+ | |||

+ | === Trajectory tracking === | ||

+ | Once the meaningful sinusoidal peaks and their parameters have been estimated, the peaks are tracked together into inter-frame trajectories. At each frame, a peak continuation algorithm tries to connect the sinusoidal peak into the already existing trajectories at the previous frame, resulting into a smooth curve of frequencies and amplitudes. The continuation was tested with two algorithms: the traditional one which uses only the parameters of the sinusoids to obtain smooth trajectories and one original method which synthesizes the possible continuations inside certain deviation limits and compares them to the original signal. There is also other systems which use more advanced methods, for example the Hidden Markov Models to track the trajectories. | ||

+ | Sinusoidal trajectories contain all the information needed for the reconstruction of the harmonic parts of input signals: amplitudes, frequencies and phases of each trajectory at each frame. To avoid discontinuities at frame boundaries, the amplitudes, frequencies and phases are interpolated from frame to frame. | ||

+ | *Amplitudes are linearly interpolated | ||

+ | * Phase interpolated with cubic polynomials | ||

+ | |||

+ | -- Any reference? [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT) |

## Latest revision as of 00:00, 13 November 2013

This page is meant to track ideas about low-delay, high-quality audio coding. The work has just started, so don't expect anything in the near future (or at all for that matter).

## Contents

## Signal types

There are many signal types that can be found:

- Sinusoids
- A few pure (or nearly pure) tones

- Harmonic
- Periodic waveforms (e.g. voice)
- Many (sometimes closely spaced) harmonics

- Shapred noise
- Signals that are (or are indistinguishable from) filtered (coloured) white noise

- Transients
- Whatever doesn’t fit above I guess

## Signal analysis

### Sinusoidal

Good when most of the energy is contained in a few sinusoids. May be problematic for very harmonic signals, e.g. a male voice may have close to a hundred harmonics in the full audio band.

### Pitch

Good for harmonic signals. Hard to estimate and code when extra sinusoids and noise are present. At 48 kHz, no need for fractional pitch or anything like that, but sub-band pitch analysis or multi-tap gain is a good idea. Also, there needs to be a way to remove the effect of sinusoids and noise. Even then removing the "noise" also means removing all excitation to the pitch predictor, so that's a problem.

### MDCT

Very general. Can code anything, but not very good at anything. High delay (2x frame size). Could put several "MDCT frames" in each codec frame to make latency smaller.

### Wavelets

Just a fancy name for sub-bands with non-uniform width. Probably similar to having an MDCT with few sub-bands, except that that the sub-bands could follow (roughly) the critical bands.

### LPC + stochastic cb

Like CELP with no pitch. Could be used to code the noisy part of the signal with low bit-rate. Would need to figure out how to preserve the energy of the noise when going with 1/2 bit per sample and less.

## Codec Structure Ideas

### Sinusoidal + wavelet

- Preemphasis
- Extract as many sinusoids as possible
- Wavelet transform
- Code wavelet coefs using VQ

### Sinusoidal, pitch and noise

- Preemphasis
- Joint pitch + sinusoidal estimation
- LPC analysis
- CELP-like coding of the residual (mainly noise)

## Estimation Ideas

### Sinusoid Estimation

Very hard to do properly, especially with reasonable complexity and low delay. Some ideas:

#### Least-square type matching

Step one: estimate sinusoid frequencies.

Tried so far:

- MUSIC fails on non-trivial signals and very complex, although there's an AES paper that recommends first whitening the noise part of the signal before applying the algo. Haven't tried that so far.
- ESPRIT fails on non-trivial signals and very complex (see above for possible solution)
- LPC would probably work, but requires an insane order -> impractical, plus it tends to be numerically unstable anyway.
- FFT poor resolution, but that's all we have left so far. There's an AES paper that describes a sort of time-domain phase unwrapping that could help.

Step two: what to match

Step three: solving

Looks like it's possible to solve an NxM least square problem in O(N*M) time using an iterative algorithm as long as the system matrix is near-orthogonal. If we want to solve **Ax**=**b** and **A**^h***A** ~= I, then we start with **x**(0)=**A**^h***b** and then:

**x**(N+1) =**x**(N) +**A**^h*(**b**-**A*****x**(N))

#### Phase lock loop (PLL)

## Quantization Ideas

After the sinusoids have been extracted they have to be quantized. The possible ways are

- Sort the sinusoids according to energy and transmit only a finite number or only ones with a specific energy or above. The indices of the sinusoids before rearranging will have to be sent.
- I think it's worth checking which is most efficient. Sorting the sinusoids will help quantizing the amplitude, but make it harder to encode frequency. Jmspeex 05:45, 28 June 2006 (PDT)

- Use the psycho acoustic properties and remove all the sinusoids, which will be masked by other tones.
- Of course, we don't want to encode perceptually irrelevant sinusoids. Actually, we want the resolution (in amplitude, phase and probably frequency) to scale with the amplitude-to-mask ratio or something like that. Jmspeex 05:45, 28 June 2006 (PDT)

- After removing perceptually irrelevant and low-energy tones the energy in each critical bands has to be adjusted to match with the initial energy.
- Possibly -- I don't know much on that topic. Monty probably has valuable experience. Jmspeex 05:45, 28 June 2006 (PDT)

- Time-differential coding of sinusoids across frames can be used
- Definitely. This is very important if we plan on using short frames. It would be important to minimize inter-frame redundancy, but still make it possible to recover from packet loss. For that, we could either use a leaky predictor (like the pitch in CELP) or use key-frames (like a video codec). Jmspeex 05:45, 28 June 2006 (PDT)

#### Quantization of frequencies

- Quantize frequencies of a few selected sinusoids and recreate other values using interpolation.
- How would you do that? (maybe I'm not following here) Jmspeex 05:45, 28 June 2006 (PDT)

#### Quantization of Amplitudes

- Model the energy curve of the sinusoids – for instance using an exponential curve
- Exponential decay might be a good way to do inter-frame prediction. Jmspeex 05:45, 28 June 2006 (PDT)

- Quantize amplitudes of a few selected sinusoids and recreate other values using interpolation.
- Possibly, but probably not at first (hard problem). Jmspeex 05:45, 28 June 2006 (PDT)

#### Quantization of phase and modulation parameters

- Can be scalar quantized with the number of bits allocated being proportional to the energy of the sinusoid
- Yes. Also, this is something that can be predicted very well across frames. It's not even necessary to make that one robust to losses, because as long as the phase is continuous, no one will notice Jmspeex 05:45, 28 June 2006 (PDT)

#### Quantization of indices

#### Quantization of energy gains in critical bands

### Excitation similarity weighting

The idea behind the ESW technique is to select sinusoids such that each new sinusoid added will provide a maximum incremental gain in matching between the auditory excitation pattern associated with the original signal and the auditory excitation pattern associated with the modeled signal. In order to accomplish this goal, an iterative process is proposed in which each sinusoid extracted during conventional analysis is assigned an excitation similarity weight. During each iteration, the sinusoid having the largest weight is added to the modeled representation. New sinusoids are accumulated until some constrain is exhausted, for example, a bit budget. The algorithm tends to converge as the number of modeled sinusoids increases

-- Not sure I understand here. Any reference? Jmspeex 05:45, 28 June 2006 (PDT)

### Trajectory tracking

Once the meaningful sinusoidal peaks and their parameters have been estimated, the peaks are tracked together into inter-frame trajectories. At each frame, a peak continuation algorithm tries to connect the sinusoidal peak into the already existing trajectories at the previous frame, resulting into a smooth curve of frequencies and amplitudes. The continuation was tested with two algorithms: the traditional one which uses only the parameters of the sinusoids to obtain smooth trajectories and one original method which synthesizes the possible continuations inside certain deviation limits and compares them to the original signal. There is also other systems which use more advanced methods, for example the Hidden Markov Models to track the trajectories. Sinusoidal trajectories contain all the information needed for the reconstruction of the harmonic parts of input signals: amplitudes, frequencies and phases of each trajectory at each frame. To avoid discontinuities at frame boundaries, the amplitudes, frequencies and phases are interpolated from frame to frame.

- Amplitudes are linearly interpolated
- Phase interpolated with cubic polynomials

-- Any reference? Jmspeex 05:45, 28 June 2006 (PDT)