Discussion:
[Freetel-codec2] Encoding decoding time TIVA
Maxime Guyon
2016-06-16 17:12:04 UTC
Permalink
Hello,

I've tested the Codec2 on a TIVA TM4C129 which is a TexasInstrument
CORTEX-M4F cadenced at 120MHz.
I checkedout the 0.5 TAG version on SVN and take the configuration of your
STM32 (running codec at CODEC2_MODE_1300)
It looks like that FPU is enabled on my device.
I've tested with maximum optimization O3 and also O4 and optimize speed.

Here are my profilling code which only add IO toggle on "c2demo.c" file
example available in src folder
The file input is "hts1a.wav"
I only added IO toggling for capturing time encoding and decoding:

* while(fread(buf, sizeof(short), nsam, fin) == (size_t)nsam) {*
* GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_4, GPIO_PIN_4);*
* codec2_encode(codec2, bits, buf);** GPIOPinWrite(GPIO_PORTF_BASE,
GPIO_PIN_4, 0);*
* GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_5, GPIO_PIN_5);*
* codec2_decode(codec2, buf, bits);** GPIOPinWrite(GPIO_PORTF_BASE,
GPIO_PIN_5, 0);*
* fwrite(buf, sizeof(short), nsam, fout);** }*
On my scope I've read the time for IO to toggle and here are my result:

-Encoding time is between *25ms *and* 42ms*
-Decoding time is between *39ms *and *56ms*

Example seem to read buffer of 320 sample, so if I take the worst case for
each functions:

-Encoding : 320 sample are encoded in 42ms so in 1second I can encode
7619samples.
-Decoding: 320 sample are decoded in 56ms so in 1second I can encode
5714samples.

So as you can see I'm not able to do realtime encoding and decoding since I
need to be able to encode at least 8000sample for do real time things (8KHz
sampling rate)

Do you have any suggestion/hint about this?
What is your encoding and decoding time on your STM32F4 board running at
168MHz?


Regards,

Max
glen english
2016-06-16 21:49:33 UTC
Permalink
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
David Rowe
2016-06-16 22:06:25 UTC
Permalink
Hi Glen,

I tried the CMSIS library fft functions a few years back however it went
slower than the standard kiss fft. Perhaps I did something wrong.

Apart from ffts, there isn't a lot of general purpose DSP code in Codec
2 that will benefit from a simple exchange of library function calls.

It is also important that Codec 2 remains cross platform, so I am not
inclined to make platform-specific changes to the source to benefit just
one specific platform.

Better to optimise the vanilla C, or just use a faster chip - MIPs are
cheap these days.

Cheers,

David
Max, are you using the CMSIS optimized DSP libraries for the code?
(IE you will need to modify the stock PC code)
The standard code is PC code. It works in the 160 meg ST just out of
grunt, not programming performance...
expect substantial improvement.
regards
Post by Maxime Guyon
Hello,
I've tested the Codec2 on a TIVA TM4C129 which is a TexasInstrument
CORTEX-M4F cadenced at 120MHz.
I checkedout the 0.5 TAG version on SVN and take the configuration of
your STM32 (running codec at CODEC2_MODE_1300)
It looks like that FPU is enabled on my device.
I've tested with maximum optimization O3 and also O4 and optimize speed.
Here are my profilling code which only add IO toggle on "c2demo.c"
file example available in src folder
The file input is "hts1a.wav"
/ while(fread(buf, sizeof(short), nsam, fin) == (size_t)nsam) {/
/
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_4, GPIO_PIN_4);
//codec2_encode(codec2, bits, buf);
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_4, 0);/
/
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_5, GPIO_PIN_5);
//codec2_decode(codec2, buf, bits);
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_5, 0);/
/
//fwrite(buf, sizeof(short), nsam, fout);
//}/
-Encoding time is between *25ms *and*42ms*
-Decoding time is between *39ms *and *56ms*
*
*
Example seem to read buffer of 320 sample, so if I take the worst case
-Encoding : 320 sample are encoded in 42ms so in 1second I can
encode 7619samples.
-Decoding: 320 sample are decoded in 56ms so in 1second I can
encode 5714samples.
So as you can see I'm not able to do realtime encoding and decoding
since I need to be able to encode at least 8000sample for do real time
things (8KHz sampling rate)
Do you have any suggestion/hint about this?
What is your encoding and decoding time on your STM32F4 board running
at 168MHz?
Regards,
Max
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
glen english
2016-06-16 22:25:35 UTC
Permalink
Hi David

for a all-platform POV, agreed.

I must try codec 2 on the M7 soon....
Post by David Rowe
Hi Glen,
I tried the CMSIS library fft functions a few years back however it went
slower than the standard kiss fft. Perhaps I did something wrong.
Apart from ffts, there isn't a lot of general purpose DSP code in Codec
2 that will benefit from a simple exchange of library function calls.
It is also important that Codec 2 remains cross platform, so I am not
inclined to make platform-specific changes to the source to benefit just
one specific platform.
Better to optimise the vanilla C, or just use a faster chip - MIPs are
cheap these days.
Cheers,
David
Max, are you using the CMSIS optimized DSP libraries for the code?
(IE you will need to modify the stock PC code)
The standard code is PC code. It works in the 160 meg ST just out of
grunt, not programming performance...
expect substantial improvement.
regards
Post by Maxime Guyon
Hello,
I've tested the Codec2 on a TIVA TM4C129 which is a TexasInstrument
CORTEX-M4F cadenced at 120MHz.
I checkedout the 0.5 TAG version on SVN and take the configuration of
your STM32 (running codec at CODEC2_MODE_1300)
It looks like that FPU is enabled on my device.
I've tested with maximum optimization O3 and also O4 and optimize speed.
Here are my profilling code which only add IO toggle on "c2demo.c"
file example available in src folder
The file input is "hts1a.wav"
/ while(fread(buf, sizeof(short), nsam, fin) == (size_t)nsam) {/
/
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_4, GPIO_PIN_4);
//codec2_encode(codec2, bits, buf);
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_4, 0);/
/
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_5, GPIO_PIN_5);
//codec2_decode(codec2, buf, bits);
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_5, 0);/
/
//fwrite(buf, sizeof(short), nsam, fout);
//}/
-Encoding time is between *25ms *and*42ms*
-Decoding time is between *39ms *and *56ms*
*
*
Example seem to read buffer of 320 sample, so if I take the worst case
-Encoding : 320 sample are encoded in 42ms so in 1second I can
encode 7619samples.
-Decoding: 320 sample are decoded in 56ms so in 1second I can
encode 5714samples.
So as you can see I'm not able to do realtime encoding and decoding
since I need to be able to encode at least 8000sample for do real time
things (8KHz sampling rate)
Do you have any suggestion/hint about this?
What is your encoding and decoding time on your STM32F4 board running
at 168MHz?
Regards,
Max
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
David Rowe
2016-06-16 22:05:38 UTC
Permalink
Hi Maxime,

I haven't noted the exact timing but full duplex encode/decode was
possible on a 168MHz STM32F4 Discovery board. The decoder took a little
longer than the encoder. So 320 samples encoded/decoded in roughly 20ms
each.

So even with a 30% reduction in clock speed, you should be doing a
little better.

In the SM1000 (and other digital radio applications) we often do half
duplex, with the modem taking up the other half.

Cheers,

David
Post by Maxime Guyon
Hello,
I've tested the Codec2 on a TIVA TM4C129 which is a TexasInstrument
CORTEX-M4F cadenced at 120MHz.
I checkedout the 0.5 TAG version on SVN and take the configuration of
your STM32 (running codec at CODEC2_MODE_1300)
It looks like that FPU is enabled on my device.
I've tested with maximum optimization O3 and also O4 and optimize speed.
Here are my profilling code which only add IO toggle on "c2demo.c" file
example available in src folder
The file input is "hts1a.wav"
/ while(fread(buf, sizeof(short), nsam, fin) == (size_t)nsam) {/
/
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_4, GPIO_PIN_4);
//codec2_encode(codec2, bits, buf);
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_4, 0);/
/
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_5, GPIO_PIN_5);
//codec2_decode(codec2, buf, bits);
//GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_5, 0);/
/
//fwrite(buf, sizeof(short), nsam, fout);
//}/
-Encoding time is between *25ms *and*42ms*
-Decoding time is between *39ms *and *56ms*
*
*
Example seem to read buffer of 320 sample, so if I take the worst case
-Encoding : 320 sample are encoded in 42ms so in 1second I can
encode 7619samples.
-Decoding: 320 sample are decoded in 56ms so in 1second I can
encode 5714samples.
So as you can see I'm not able to do realtime encoding and decoding
since I need to be able to encode at least 8000sample for do real time
things (8KHz sampling rate)
Do you have any suggestion/hint about this?
What is your encoding and decoding time on your STM32F4 board running at
168MHz?
Regards,
Max
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Steve
2016-06-17 00:34:02 UTC
Permalink
Just a wild guess, but maybe define NDEBUG (-DNDEBUG) in the make call
which will get rid of all the assert() check macro's. Making it a release
rather than a debug version.

Maybe save a few ms. I think the FFT is the real cruncher though.
Maxime Guyon
2016-06-17 08:04:20 UTC
Permalink
Hello,

@Steve: Thank you for the hint to define NDEBUG, I tested it but timing
remain approximately the same.

@glen english: Thank you for your hint about CMSIS, I will try to compile
CMSIS for my platform but seem to be not straightforward since I've never
used it before.

@David Rowe: I saw some other post where you effectively say that you have
already tested CMSIS FFT without speed improving...
-Do you already have the piece of code for replace kissfft with CMSIS
function?
-Can you provide some clue to test it and some step to replace kissfft?
-I saw a post but cannot find it now where you say that there is a file for
test FFT, maybe it's the good starting point for test CMSIS speed
improvement.

Regards,

Max
Post by Steve
Just a wild guess, but maybe define NDEBUG (-DNDEBUG) in the make call
which will get rid of all the assert() check macro's. Making it a release
rather than a debug version.
Maybe save a few ms. I think the FFT is the real cruncher though.
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and
traffic
patterns at an interface-level. Reveals which users, apps, and protocols
are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Maxime Guyon
2016-06-17 11:20:40 UTC
Permalink
Hello,

Some news:

I've compiled the CMSIS library for my target.
I've evaluated the performance of "sinf", "cosf" functions and their
equivalent "arm_cos_f32" and "arm_sin_f32":

CMSIS functions are 15% more faster than the classic one.

I've replaced all call to "sinf" and "cosf" by "arm_cos_f32" and
"arm_sin_f32" in codec2.
No remarkable difference, it's due to the fact that it seem kissFFT
precalculate table for cosinus and sinus function at creation or codec.

I've found the test you did with CMSIS for FFT in STM32 directory
"fft_test.c".
I'll try to get it working for my target and return back my profile result.

Regards
Post by Maxime Guyon
Hello,
@Steve: Thank you for the hint to define NDEBUG, I tested it but timing
remain approximately the same.
@glen english: Thank you for your hint about CMSIS, I will try to compile
CMSIS for my platform but seem to be not straightforward since I've never
used it before.
@David Rowe: I saw some other post where you effectively say that you have
already tested CMSIS FFT without speed improving...
-Do you already have the piece of code for replace kissfft with CMSIS
function?
-Can you provide some clue to test it and some step to replace kissfft?
-I saw a post but cannot find it now where you say that there is a file
for test FFT, maybe it's the good starting point for test CMSIS speed
improvement.
Regards,
Max
Post by Steve
Just a wild guess, but maybe define NDEBUG (-DNDEBUG) in the make call
which will get rid of all the assert() check macro's. Making it a release
rather than a debug version.
Maybe save a few ms. I think the FFT is the real cruncher though.
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and
traffic
patterns at an interface-level. Reveals which users, apps, and protocols
are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
glen english
2016-06-17 11:27:12 UTC
Permalink
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
Maxime Guyon
2016-06-17 13:22:50 UTC
Permalink
I've finished my test for profile kissFFT and the CMSIS equivalent,

Speed of KissFFT function "kiss_fft" is compared to the CMSIS function
"arm_cfft_radix2_f32"

For the test, I used the same file used by David with 1024Sample of input.
I noticed that the time to process is a little dependant of the data
amplitude, less than the data frequency.
Execution time for transform the 1024 sample are:

With random data on it:
- KissFFT: 1.2282ms
- CMSIS: 1.2168ms -> about 1% speedup

With only 0 in sample:
- KissFFT: 1.2366ms
- CMSIS: 1.2094ms -> about 1.2% speedup

With sinusoid of amplitude 0.5 at 2500Hz (Sampling at 8kHz):
- KissFFT: 1.2281ms
- CMSIS: 1.2167ms -> about 1% speedup

With sinusoid of amplitude 1 at 2500Hz (Sampling at 8kHz):
- KissFFT: 1.2076ms
- CMSIS: 1.2089ms -> about -0.01% slowdown

With sinusoid of amplitude 2 at 2500Hz (Sampling at 8kHz):
- KissFFT: 1.2367ms
- CMSIS: 1.2095ms -> about 2.2% speedup

With sinusoid of amplitude 3 at 2500Hz (Sampling at 8kHz):
- KissFFT: 1.2281ms
- CMSIS: 1.2167ms -> about 1% speedup

To conclude, CMSIS is very little faster than kissFFT in my test (contrary
to the test result made by David which found CMSIS more slower)
But the gain is very poor and will not be sufficient for my target to do
decoding in real time...

I still cannot understand why a 120MHz Cortex-M4 from TI is not capable to
to the same job of your Cortex-M4 @168Mhz from ST which you say is loaded
only at 50%

To finish, it would be nice to measure the encoding and decoding time with
your target and the same file "c2demo.c" and report result.
Just to be sure that my target should be capable of encode and decode real
time...
Maybe your target is loaded at more that 50%?

Regards,

Max.
Post by David Rowe
Hi Maxime
nice work. Yes I would expect the FFT twiddle factors to be pre calculated.
I expect a fair bit of work would be required to use the CMSIS functions.
I use them extensively in my DSP and audio processing work, and the filter
primitives are nice and fast.
They are certainly worthwhile using.
David is probably right- alot of the codec2 code is just general C code.
I bet though some work and performance would improve (to suit the STM32),
there are always improvements on the edges.
.. or use a bigger processor...
cheers
glen
Hello,
I've compiled the CMSIS library for my target.
I've evaluated the performance of "sinf", "cosf" functions and their
CMSIS functions are 15% more faster than the classic one.
I've replaced all call to "sinf" and "cosf" by "arm_cos_f32" and
"arm_sin_f32" in codec2.
No remarkable difference, it's due to the fact that it seem kissFFT
precalculate table for cosinus and sinus function at creation or codec.
I've found the test you did with CMSIS for FFT in STM32 directory
"fft_test.c".
I'll try to get it working for my target and return back my profile result.
Regards
Post by Maxime Guyon
Hello,
@Steve: Thank you for the hint to define NDEBUG, I tested it but timing
remain approximately the same.
@glen english: Thank you for your hint about CMSIS, I will try to compile
CMSIS for my platform but seem to be not straightforward since I've never
used it before.
@David Rowe: I saw some other post where you effectively say that you
have already tested CMSIS FFT without speed improving...
-Do you already have the piece of code for replace kissfft with CMSIS
function?
-Can you provide some clue to test it and some step to replace kissfft?
-I saw a post but cannot find it now where you say that there is a file
for test FFT, maybe it's the good starting point for test CMSIS speed
improvement.
Regards,
Max
Post by Steve
Just a wild guess, but maybe define NDEBUG (-DNDEBUG) in the make call
which will get rid of all the assert() check macro's. Making it a release
rather than a debug version.
Maybe save a few ms. I think the FFT is the real cruncher though.
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
glen english
2016-06-17 21:57:21 UTC
Permalink
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
David Rowe
2016-06-17 23:12:15 UTC
Permalink
Hi Glen,
David- can you provide yoru compiler compile and link flags?
http://svn.code.sf.net/p/freetel/code/codec2-dev/stm32/Makefile

- David
Steve
2016-06-18 00:03:01 UTC
Permalink
It would be interesting to try -O2 optimization. There's a lot of
voodoo going to -O3.
Maxime Guyon
2016-06-18 00:08:16 UTC
Permalink
I compiled all things (cmsis and program) with O3 thinking that it was more
efficient than O2.

I will test it with O2 but only Monday if i have the time.

Abd i will also provide you source code if you want and time permit it!

Regards

Max
Post by Steve
It would be interesting to try -O2 optimization. There's a lot of
voodoo going to -O3.
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Steve
2016-06-18 03:25:18 UTC
Permalink
Another algorithm that seems to suck a lot of CPU is
phase_synth_zero_order() in decoding, and really the only thing in there is
atan2() and floor(). (you've already changed the sin/cos). So maybe the
CMSIS has a better version for those two?

I know floor() is really a slow algorithm in gcc.

http://stackoverflow.com/questions/824118/why-is-floor-so-slow
glen english
2016-06-18 07:54:05 UTC
Permalink
RRR
I usually find O3 fractionally faster but alot of things break that I
dont expect (bad programming habits?). they don't break in O2. some
unexpected assumptions are made...
Post by Steve
Another algorithm that seems to suck a lot of CPU is
phase_synth_zero_order() in decoding, and really the only thing in
there is atan2() and floor(). (you've already changed the sin/cos). So
maybe the CMSIS has a better version for those two?
I know floor() is really a slow algorithm in gcc.
http://stackoverflow.com/questions/824118/why-is-floor-so-slow
Maxime Guyon
2016-06-20 09:05:01 UTC
Permalink
Hello,

Good news, after some work it seem I've got a working solution.

@Steve, note that the CMSIS do not provide function for *atan2()* and
*floor()...*

First I've tested to do only single precision operation (do not allow
double operation because M4 cannot handle double precision operation on the
hardware FPU).
In Codec2 it seem that a lot of operation involve double which seem not
always necessary, for example all the define M_PI and other value which are
not defined whith the suffix 'f' are considered by C ANSI code to be double.
So operation with those define are double operation not optimized.
The same occur each time you do an operation with a litteral without the
'f' suffix. For example (0.5*x or 0.5+x is a double operation which I
change to "0.5f*x" and "0.5f+x").

After fixing this in the Codec, I came to a speedup of at least 10% for the
decoding!!
I cannot say if this is a good hint for speed up and if you can live with
the loss of precision but if yes, maybe this fix can be done in your main
repository

After that I tested some other compile option and optimization (O2 and some
other inlining) without success.

Finally I tested to pass the floating point mode of the target from
"strict" to "relaxed".
See the definition in wiki:

Relaxed mode prioritizes speed over strict correctness. In relaxed mode,
the compiler may perform speed optimizations at the expense of reducing the
precision of some calculations, typically a tiny amount. For instance,
(X/3) is not precisely equivalent to (X*(1.0/3)), but in relaxed mode, the
compiler is allowed to make this transformation anyway, as multiplication
is much faster than division.
Changing that provide me a speed up of about 45%!!!!
Here are the encoding time after all fix:

-Encoding time *without *modification was between *25ms *and* 42ms / *After
modification it is between *: ** 18ms *and *19ms *so a speed up of about
55%. My processor will be loaded at *48%* for encoding sound at 8000Khz.
-Decoding time *without* modification was between *39ms *and
*56ms** / *After
modification it is between *: ** 23ms *and *27ms *so a speed up of about
52%. My processor will be loaded at *68%* for decoding sound at 8000Khz.

I've played back the encoded stream at 1200bps and 1300bps and everything
seem okay: I cannot hear any strong difference between the encoded version
with modification and without my modification.
Hope that this will help some other people to get it working on their
target.

Regards,

Max
RRR
I usually find O3 fractionally faster but alot of things break that I
dont expect (bad programming habits?). they don't break in O2. some
unexpected assumptions are made...
Post by Steve
Another algorithm that seems to suck a lot of CPU is
phase_synth_zero_order() in decoding, and really the only thing in
there is atan2() and floor(). (you've already changed the sin/cos). So
maybe the CMSIS has a better version for those two?
I know floor() is really a slow algorithm in gcc.
http://stackoverflow.com/questions/824118/why-is-floor-so-slow
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Bruce Perens
2016-06-20 09:18:08 UTC
Permalink
Many, perhaps most, CPUs have a float type that is slower than double,
because their internal hardware is double-only and they convert float to
double and back to float on every operation. Don't change the main source
to float types. Use macros, typedefs, or compiler switches. Also, David
will probably not be happy of the code is made less readable.
Post by Maxime Guyon
Hello,
Good news, after some work it seem I've got a working solution.
@Steve, note that the CMSIS do not provide function for *atan2()* and
*floor()...*
First I've tested to do only single precision operation (do not allow
double operation because M4 cannot handle double precision operation on the
hardware FPU).
In Codec2 it seem that a lot of operation involve double which seem not
always necessary, for example all the define M_PI and other value which are
not defined whith the suffix 'f' are considered by C ANSI code to be double.
So operation with those define are double operation not optimized.
The same occur each time you do an operation with a litteral without the
'f' suffix. For example (0.5*x or 0.5+x is a double operation which I
change to "0.5f*x" and "0.5f+x").
After fixing this in the Codec, I came to a speedup of at least 10% for
the decoding!!
I cannot say if this is a good hint for speed up and if you can live with
the loss of precision but if yes, maybe this fix can be done in your main
repository
After that I tested some other compile option and optimization (O2 and
some other inlining) without success.
Finally I tested to pass the floating point mode of the target from
"strict" to "relaxed".
Relaxed mode prioritizes speed over strict correctness. In relaxed mode,
the compiler may perform speed optimizations at the expense of reducing the
precision of some calculations, typically a tiny amount. For instance,
(X/3) is not precisely equivalent to (X*(1.0/3)), but in relaxed mode, the
compiler is allowed to make this transformation anyway, as multiplication
is much faster than division.
Changing that provide me a speed up of about 45%!!!!
-Encoding time *without *modification was between *25ms *and* 42ms / *After
modification it is between *: ** 18ms *and *19ms *so a speed up of about
55%. My processor will be loaded at *48%* for encoding sound at 8000Khz.
-Decoding time *without* modification was between *39ms *and *56ms*
* / *After modification it is between *: ** 23ms *and *27ms *so a speed
up of about 52%. My processor will be loaded at *68%* for decoding sound
at 8000Khz.
I've played back the encoded stream at 1200bps and 1300bps and everything
seem okay: I cannot hear any strong difference between the encoded version
with modification and without my modification.
Hope that this will help some other people to get it working on their
target.
Regards,
Max
RRR
I usually find O3 fractionally faster but alot of things break that I
dont expect (bad programming habits?). they don't break in O2. some
unexpected assumptions are made...
Post by Steve
Another algorithm that seems to suck a lot of CPU is
phase_synth_zero_order() in decoding, and really the only thing in
there is atan2() and floor(). (you've already changed the sin/cos). So
maybe the CMSIS has a better version for those two?
I know floor() is really a slow algorithm in gcc.
http://stackoverflow.com/questions/824118/why-is-floor-so-slow
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
glen english
2016-06-20 09:56:50 UTC
Permalink
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
Steve
2016-06-20 10:06:15 UTC
Permalink
Max, that is interesting as I was under the impression
-fsingle-precision-constant compiler switch handled all that.

Great detective work!
Maxime Guyon
2016-06-20 10:24:35 UTC
Permalink
I'm agree with glen english for the float parts, use only double where it's
necessary...

Also Bruce, it seem that many CPU which can handle double operation can
also handle simultaneous single float operation which will not reduce the
speed...

Saving cycle is important on embedded platform and even if MIPS are cheap,
energy and power is not the same story in embedded system.
Smaller CPU often provide the lowest standby power consumption and so a
longer working time on battery powered device.

Regards.
Post by Steve
Max, that is interesting as I was under the impression
-fsingle-precision-constant compiler switch handled all that.
Great detective work!
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Bruce Perens
2016-06-20 10:47:39 UTC
Permalink
If you _really_ want high performance and low power consumption, get to
work on that fixed-point port! See you next year!

I think it would be reasonable to require benchmarks across divergent
platforms with compiler switches and other configuration well documented
before making this sort of change. In the meantime, a script for the patch
program, which can be applied trivially, would be the best way to go. It
would allow you to keep in sync with David's changes easily.

And finally, if you are not porting to a product that already has a TIVA in
it, now is the time to substitute the well-tested and supported STM32F405
or higher. At least on Digikey, the STM doesn't cost more than the TIVA.

Thanks

Bruce
glen english
2016-06-21 05:52:12 UTC
Permalink
On the PC, changing from single to double for large datasets starts to
really hurt because the effective doubling of the dataset space starts
to hurt you on cache misses ...
Not such an issue on non cache single-cycle access SRAM type
architectures. (though a 64 bit fetch from a 32 bit wide SRAM memory is
GOING to cost something).

the M_PI is a good point though.... easily overlooked as doubling
everything....

g
Post by Maxime Guyon
I'm agree with glen english for the float parts, use only double where
it's necessary...
Also Bruce, it seem that many CPU which can handle double operation
can also handle simultaneous single float operation which will not
reduce the speed...
Saving cycle is important on embedded platform and even if MIPS are
cheap, energy and power is not the same story in embedded system.
Smaller CPU often provide the lowest standby power consumption and so
a longer working time on battery powered device.
Regards.
David Rowe
2016-06-21 06:11:25 UTC
Permalink
Post by glen english
the M_PI is a good point though.... easily overlooked as doubling
everything....
Would have though the compiler would pick that up at compile time.
Really need to see this assumption (and the other optimisations) tested
on other machines ....

- David

David Rowe
2016-06-20 20:58:19 UTC
Permalink
Hello Maxime,

Good work on you optimisation! I would be interested to see if similar
changes produce the same performance gains on the STM32F4 platform.

Would anyone like to try those changes, run some benchmarks, and post
the results?

When I performed optimisation work in the past, I wrote code to compare
the output waveform to the original vanilla C. For example making sure
the output samples are withing 1 part in 1000 of each other. Very
important to make sure the algorithm is not affected, and tricky to do
with speech signals.

If anyone is interested in this work please contact me.

Thanks,

David
Post by Maxime Guyon
Hello,
Good news, after some work it seem I've got a working solution.
@Steve, note that the CMSIS do not provide function for *atan2()* and
*floor()...*
First I've tested to do only single precision operation (do not allow
double operation because M4 cannot handle double precision operation on
the hardware FPU).
In Codec2 it seem that a lot of operation involve double which seem not
always necessary, for example all the define M_PI and other value which
are not defined whith the suffix 'f' are considered by C ANSI code to be
double.
So operation with those define are double operation not optimized.
The same occur each time you do an operation with a litteral without the
'f' suffix. For example (0.5*x or 0.5+x is a double operation which I
change to "0.5f*x" and "0.5f+x").
After fixing this in the Codec, I came to a speedup of at least 10% for
the decoding!!
I cannot say if this is a good hint for speed up and if you can live
with the loss of precision but if yes, maybe this fix can be done in
your main repository
After that I tested some other compile option and optimization (O2 and
some other inlining) without success.
Finally I tested to pass the floating point mode of the target from
"strict" to "relaxed".
Relaxed mode prioritizes speed over strict correctness. In relaxed
mode, the compiler may perform speed optimizations at the expense of
reducing the precision of some calculations, typically a tiny
amount. For instance, (X/3) is not precisely equivalent to
(X*(1.0/3)), but in relaxed mode, the compiler is allowed to make
this transformation anyway, as multiplication is much faster than
division.
Changing that provide me a speed up of about 45%!!!!
-Encoding time *without *modification was between *25ms *and* 42ms /
*After modification it is between *: ** 18ms *and *19ms *so a speed up
of about 55%. My processor will be loaded at *48%* for encoding sound at
8000Khz.
-Decoding time *without* modification was between *39ms *and
*56ms** / *After modification it is between *: ** 23ms *and *27ms *so a
speed up of about 52%. My processor will be loaded at *68%* for decoding
sound at 8000Khz.
I've played back the encoded stream at 1200bps and 1300bps and
everything seem okay: I cannot hear any strong difference between the
encoded version with modification and without my modification.
Hope that this will help some other people to get it working on their
target.
Regards,
Max
RRR
I usually find O3 fractionally faster but alot of things break that I
dont expect (bad programming habits?). they don't break in O2. some
unexpected assumptions are made...
Post by Steve
Another algorithm that seems to suck a lot of CPU is
phase_synth_zero_order() in decoding, and really the only thing in
there is atan2() and floor(). (you've already changed the
sin/cos). So
Post by Steve
maybe the CMSIS has a better version for those two?
I know floor() is really a slow algorithm in gcc.
http://stackoverflow.com/questions/824118/why-is-floor-so-slow
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Loading...