This is the demonstration page of the paper "Audio Decoding by Inverse Problem Solving" with the samples used for the MUSHRA-style listening tests.
We consider audio decoding as an inverse problem and solve it through diffusion posterior sampling. Explicit conditioning functions are developed for input signal measurements provided by an example of a transform domain perceptual audio codec. Viability is demonstrated by evaluating arbitrary pairings of a set of bitrates and task-agnostic prior models. For instance, we observe significant improvements on piano while maintaining speech performance when a speech model is replaced by a joint model trained on both speech and piano. With a more general music model, improved decoding compared to legacy methods is obtained for a broad range of content types and bitrates. The noisy mean model, underlying the proposed derivation of conditioning, enables a significant reduction of gradient evaluations for diffusion posterior sampling, compared to methods based on Tweedie's mean. Combining Tweedie's mean with our conditioning functions improves the objective performance.
Pedro J. Villasana T., Lars Villemoes, Janusz Klejsa, Per Hedelin (2024). Audio Decoding by Inverse Problem Solving.
We evaluate the proposed methodology on Speech (using the VCTK dataset [1]), Piano (using the Supra dataset [2]), and Critical (using the ODAQ dataset [3]) audio signals.
Two MUSHRA-like listening tests were conducted to compare the legacy decoding method (DEC) against the proposed diffusion-based decoding algorithm (INV). The audio signals of both tests were encoded at 16 kb/s. Also included in the tests were a hidden reference and a 3.5 kHz lowpass anchor (LP35).
These items were never seen during training.
This section contains the 9 items used in the first MUSHRA-style listening test.
Item | REF | DEC | INVspeech | INVjoint | INVmusic | LP35 |
---|---|---|---|---|---|---|
Speech 1 | ||||||
Speech 2 | ||||||
Speech 3 | ||||||
Piano 1 | ||||||
Piano 2 | ||||||
Piano 3 | ||||||
Critical 1 | ||||||
Critical 2 | ||||||
Critical 3 |
This section contains the 10 items used in the second MUSHRA-style listening test.
Item | REF | DEC | INVmusic | AAC | Opus | LP35 |
---|---|---|---|---|---|---|
Item 1 | ||||||
Item 2 | ||||||
Item 3 | ||||||
Item 4 | ||||||
Item 5 | ||||||
Item 6 | ||||||
Item 7 | ||||||
Item 8 | ||||||
Item 9 | ||||||
Item 10 |
[1] | Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald, et al., "CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92)," University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019. |
[2] | Zhengshan Shi, Craig Sapp, Kumaran Arul, Jerry McBride, and Julius O Smith III, "SUPRA: Digitizing the Stanford University Piano Roll Archive," in ISMIR, 2019, pp. 517–523. |
[3] | Matteo Torcoli, Chih-Wei Wu, Sascha Dick, Phillip A. Williams, Mhd Modar Halimeh, William Wolcott, Emanuël A. P. Habets, "ODAQ: Open Dataset of Audio Quality," in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 836–840. |