as the optimised ARM FFT on the M4. I also found no performance
improvement when I tried the optimised ARM FFT a few years ago.
I'm inclined to keep malloc/free in codec 2. Happy to look a patches
scratch.
Danilo - I will get back to you on your other points and proposal shortly.
weeks. The FreeDV 1600 decoder is now around 40% faster on the STM32F4
easily apply and test. He has shortened my TODO list - not made it
longer. Very important for me and a fine example to anyone else who
would like to contribute to Codec 2. Thanks Danilo!
Post by Danilo BeucheHi,
Since we finished and removed most of the easy to remove performance
hotspots with the exception of the kiss_fft calls which now contribute a
major part to the overall runtime of modem and decoder, I played around
with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs.
Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32)
the time difference is not existing if used in our mcHF code. How does
this relate to the measurements of Glen (he measure better performance)?
I believe this is due to the fact that the arm lib stores some of its
data in precomputed flash arrays. Access to flash is slow (5 wait
states), so this reduces the performance. kiss has all its data in RAM.
Since Glen did initialize the arm DSP tables in RAM, he got speed gains
on the expense of RAM. On a STM32F746 RAM this is not as much of an
issue (384K) as it is on the STM32F4 (the default MCU in the mcHF
project has 192K RAM and we have to fit the full SDR firmware RAM needs
into it this space). Speed is traded for RAM use reduction. Since have
reached our goal timewise, now memory reductions is more in focus for us
(but it should not get slower of course).
Because of that I would like to propose the following approach to keep
the code easily readable while providing efficient solutions for the ARM
1. We create an abstract interface for running fft in the codec2
sources. Initially this should closely resemble the existing kiss_fft
calls, which makes introducing the interface easy, since we may
use all the existing tests and can verify that introducing the interface
does not change a bit in the output.
2. Once validated, we can now introduce/activate the use of the arm DSP
FFT with some glue code to map between the abstract interface and the
arm DSP interface. Here again we have to validate everything is working
nicely but we will see some slight differences I assume. However, with
it we can produce reference data for step 3
3. Now we modify the existing code so that we can benefit from some nice
properties of the arm DSP fft (inplace FFT) which means this will reduce
RAM usage significantly (in relation to 192K RAM).
4. We enable optional use of RAM instead of flash in the ARM code, so
that depending on the amount of available memory you can get some extra
boost.
For that to work nicely, we have to fix some issues in the existing code
first, so here comes
0. As Glen pointed out, some of the #define constant have not so good
names, especially M (defined for 2 different purposes in defines.h and
fdmdv_internal.h) is nasty and also N in defines.h (there are some local
variables N and the stm headers also get confused by it). So we need to
change these to something unambiguous. I think Glen already suggest
names for them.
And I would like to point out, that the use of dynamic memory allocation
(malloc/free) is necessary in our mcHF case, so I would like to keep
this more or less as it is. The mcHF needs the ability to reuse the
memory for other operational modes, if FreeDV is not active. Which does
not mean I am against removing the internal use of malloc, but then it
should be possible to easily create the required data structures
"outside" the code using malloc. I.e. the use of static data structures
for anything but const data is a no go.
To support that discuss I created a draft suggestion for the interface
(attached to this mail). It is right now defined using inline code for
the sake of simplicity. This may change later, I don't think there
should be any issue with that. It essentially contains 3 functions for
complex fft and real fft each (alloc,fft,free) and the necessary data
structures.
Danilo
Post by Dana MyersSubject: Re: [Freetel-codec2] more benching and thoughts
Date: Fri, 16 Sep 2016 12:13:05 +1000
Hi Danilo
Yeah, I guess being a very bare metal programmer from the old 128 byte
RAM days, , I dislike MALLOCs in embedded code on principal.
I'm similar, though now that we have 32kB, 64kB (or even more) RAM
in embedded chips, they're basically like the systems that malloc()
was initially built on :-)
I still don't trust C++ heap allocators in embedded applications, though.
However, because the heap usage would be deterministic, it should be
fairly safe.
I have a similar project; a 1200 baud modem + TNC stack built in a PSoC 5LP.
to do bandpass heavy-lifting and frequency response correction for the ADC.
I use CMSIS-DSP for the rest of the DSP crunching required, and, wait for it,
wrap the whole thing in FreeRTOS (9.0.0 now). I use q31_t for all the DSP,
as long as I'm careful to avoid blowing out past the +/- 1.0 range, it's probably
every bit as good as single-precision floating point. The PSoC 5LP has a Cortex-M3,
I'm running it at 80MHz.
My dynamic buffer implementation uses ...
*******************************************************************************
Take a look at the memory management routine heap2.c in freertos.c
(in fact, there are heap1,2,3,4,5 .c - a few options... try heap4, also)
-this is a much smarter memory alloc and dealloc routine that is fairly
cheap.
much better than usual brain dead malloc.
********************************************************************************
I'd recommend using that. It looks for blocks same size, existing used etc
heap_4.c and, while I've never explicitly profiled it, I've never had a reason to
suspect the allocator is misbehaving. I commit 32KB to the heap and currently
never use more than about 3KB.
I would expect the same improvements on the F4 as the F7 using the CMSIS
library. The F7 is much faster on that sort of code.
I only got rid of the FFT malloc stuff the huge stack additions are
still in there
and you could save 50% there ...
Without knowing the details of this application (I'm new here), I am quite
impressed with the quality of CMSIS-DSP, particularly in terms of exploiting
the ARM extensions.
Cheers,
Dana
-glen
Post by Daniele BarzottiHi Glen,
nice, would be interesting to see how much the STM32F4 gains by use of
CMSIS FFT routines.
BTW, I am not sure, but I think you mentioned the removal of malloc as
one of your changes. For us with the mcHF it would not be good to have
the memory for FreeDV code statically allocated since FreeDV is just one
operation mode of the mcHF, and we need the memory at other times for
other stuff, especially since it really eats a lot of memory (in
relation to the STM32F4 RAM sizes). Even half of it is still a lot.
Looking forward to gain some more free cycles with your work.
Regards,
Danilo
Post by glen englishHi Danilo
yeah, you have plenty in hand.
OK so M7 and CMSIS FFT, about 2 x speed (same clock) 7.74mS (1200bps)
for decode.
Post by Danilo BeucheH
measured 17.3ms per 40ms interval for the voice decode part only (this
is only happening once the modem is synced) and roughly 5ms of
fdmdv_demod per 20ms interval (happens all the time). Which gives us in
total some 27ms per 40ms once synced. This is about 68% load.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2