[Freetel-codec2] Using Deep Learning to Reconstruct High-Resolution Audio

Discussion:

Ricardo Andere de Mello

2017-06-23 23:11:14 UTC

Hi,

Recently I have been working with deep learning, but mainly focused in
image recognition.

I found one article related to audio, and I thought you would find it
interesting:

https://blog.insightdatascience.com/using-deep-learning-to-reconstruct-high-resolution-audio-29deee8b7ccd

I was wondering if it would be possible to use codec2 frames as inputs for
training.

[]s, Ricardo Mello

Tomas Härdin

2017-06-26 09:12:18 UTC

Permalink

Hi

Looks like this just extrapolates the upper part of the spectrum, and
not super well judging by those images. codec2 already tries to code
the harmonic content via vector quantization, so isn't really low-
samplerate in the same way. In fact, I think David is working on
getting 16+ ksps working. If done right then the decoder can "invent"
contents for the 4-8 kHz band even if the original signal was 0-4 kHz
(8 ksps).

I'm somewhat skeptical of applying multi-layer neural networks ("deep
learning") to everything, since you don't really understand what the
hell it's doing. Plus there's always the risk of overfitting. But this
doesn't mean you couldn't do good work with them, especially if the
input has been appropriately preprocessed. Something like an
autoencoder[1] could be useful as part of a whole class of codecs. I
experimented with this in the early 2000's. Another idea is to do some
kind of temporal thing; codec2 has almost no memory at the moment I
think.

There's a corpus of speech samples used to train the vector quantizer,
perhaps you could experiment on that? Turn that inspiration into
perspiration!

/Tomas

[1]Â https://en.wikipedia.org/wiki/Autoencoder

Post by Ricardo Andere de Mello
Hi,
Recently I have been working with deep learning, but mainly focused
in image recognition.
Â
I found one article related to audio, and I thought you would find it
https://blog.insightdatascience.com/using-deep-learning-to-reconstruc
t-high-resolution-audio-29deee8b7ccd
I was wondering if it would be possible to use codec2 frames as
inputs for training.
[]s, Ricardo Mello

David Rowe

2017-06-26 21:18:34 UTC

Permalink

Thanks Ricardo,

The demo was interesting, some high frequency reconstruction. I wonder
if they mean "kHz" in the article rather than "kbps) when they referred
to down sampling. The banner at the bottom made reading the article
really annoying.

As it happens I playing with some similar ideas right now. Speech does
have some correlation across the frequency axis. So if you know one
part of the spectrum of a frame of speech, you also have some
information about the other part.

This correlation can be found using neural nets, or vector quantisation.
Or we can explicitly remove the correlation using techniques like a
DCT, leaving us much less information to sen dover the channel.

Cheers,

David

Post by Ricardo Andere de Mello
Hi,
Recently I have been working with deep learning, but mainly focused in
image recognition.
I found one article related to audio, and I thought you would find it
https://blog.insightdatascience.com/using-deep-learning-to-reconstruct-high-resolution-audio-29deee8b7ccd
I was wondering if it would be possible to use codec2 frames as inputs
for training.
[]s, Ricardo Mello
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

glen english

2017-06-26 21:28:37 UTC

Permalink

That demo IS interesting !!

the HF extension a bit like the extensions uses in MPEG4 AAC-PLUS.. but
the AAC version is I understand quite basic in its method.

I've only ever heard it on music. It's quite convincing, at least on a
small speaker in a non critical listening environment.

There is a huge intelligibility improvement getting speech to 4.5kHz
from 3 , a bit more to 5 kHz, I think diminishing returns after 5.5...

But rather than synthesing a high end, I would think encoding the HF/
sibilant information would be more rewarding. We only want out to ~
5.5kHz. maybe just noise on/ noise off encoded by a couple of bits of
amplitude.

-glen

Post by David Rowe
Thanks Ricardo,
The demo was interesting, some high frequency reconstruction. I
wonder if they mean "kHz" in the article rather than "kbps) when they
referred to down sampling. The banner at the bottom made reading the
article really annoying.
As it happens I playing with some similar ideas right now. Speech
does have some correlation across the frequency axis. So if you know
one p

David Rowe

2017-06-26 22:24:36 UTC

Permalink

As Tomas suggests, I have been working on a wideband mode:

http://www.rowetel.com/?p=5711

I'm currently using a 2D DCT approach (bit like JPG) to code a version
of the spectrum. VQ could also be used.

We don't really need to reg-generate the upper 4kHz as it doesn't take
many bits to encode it faithfully (perhaps 20% more than the first
4kHz). The log(f) response of the ear means there isn't much info there
we can actually perceive.

- David

Walter Holmes

2017-06-27 00:26:55 UTC

Permalink

Fascinating stuff there David.

I can't wait see this progress through the development cycles.

All the best..

Walter/K5WH

-----Original Message-----
From: David Rowe [mailto:***@rowetel.com]
Sent: Monday, June 26, 2017 5:25 PM
To: freetel-***@lists.sourceforge.net
Subject: [Freetel-codec2] Codec 2 wideband

As Tomas suggests, I have been working on a wideband mode:

http://www.rowetel.com/?p=5711

I'm currently using a 2D DCT approach (bit like JPG) to code a version of
the spectrum. VQ could also be used.

We don't really need to reg-generate the upper 4kHz as it doesn't take many
bits to encode it faithfully (perhaps 20% more than the first 4kHz). The
log(f) response of the ear means there isn't much info there we can actually
perceive.

- David

----------------------------------------------------------------------------
--
Check out the vibrant tech community on one of the world's most engaging
tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Freetel-codec2 mailing list
Freetel-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Albert Cahalan

2017-06-28 03:57:48 UTC

Permalink

Post by David Rowe
I'm currently using a 2D DCT approach (bit like JPG) to code a version
of the spectrum. VQ could also be used.
We don't really need to reg-generate the upper 4kHz as it doesn't take
many bits to encode it faithfully (perhaps 20% more than the first
4kHz). The log(f) response of the ear means there isn't much info there
we can actually perceive.

Tomas

2017-06-28 06:46:10 UTC

Permalink

---- Albert Cahalan skrev ----

Post by Albert Cahalan

Post by David Rowe
I'm currently using a 2D DCT approach (bit like JPG) to code a version
of the spectrum. VQ could also be used.

This sounds a bit like cepstral analysis, which was invented partly to analyze harmonics. Played around with it in my master's thesis.

Post by Albert Cahalan

Post by David Rowe
We don't really need to reg-generate the upper 4kHz as it doesn't take
many bits to encode it faithfully (perhaps 20% more than the first
4kHz). The log(f) response of the ear means there isn't much info there
we can actually perceive.

Mel or Bark spacing may be more appropriate. Unfortunately harmonics won't line up as nicely if you do either of those

/Tomas

David Rowe

2017-06-28 21:29:02 UTC

Permalink

Hi Tomas and Albert,

700C works re-sampling the variable rate L harmonics (where L varies
frame to frame based on the pitch) to a fixed K=20 vector with
mel-spaced samples. Then we VQ the K sample vector.

This works OK, but does introduce some distortion, which I traced to
under-sampling of high frequency spectral peaks.

Turns out that although the ear is sensitive to log(f), we sometimes
generate sound energy with a narrow bandwidth. That needs to be
captured for higher quality speech.

Cheers,

David

Post by Albert Cahalan

You might try a non-linear scaling prior to the 2D DCT.
Squish and stretch it as you would if making a log-normal graph.
That is, so that your spectrum plot shows octaves with equal spacing,
such that you could line it up with a musical keyboard.
A transform of that nature lets you handle the higher frequencies
with an appropriate level of detail.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2