Discussion:
[Freetel-codec2] Using codec2 on ATMEL 32-bit RISC microcontroller
Daniele Barzotti
2016-08-29 13:44:38 UTC
Permalink
Hi,

First of all, I'm new to codec2.

I'm developing on a proprietary board based on ATMEL AT32UC3B0512, a
8MByte Spansion Flash and FreeRTOS.

What I need to do is to store 8 hours of voice into the flash.
My input stream is a PCM 16 bit 8Khz and I got 20 dword (1 frame) every
2.5 msec.
Here I have to compress it to a 2 kilobit/second stream. (So here I need
only the encoder part)

I added the codec2 source to my project and it compile with no problems
but, when I run, it hangs on codec2_create().

I tried to create the CODEC2 structure on the stack but with no success.

My question is: someone known if is possible to run codec2 into a 32-bit
RISC?

Thanks in advance,
Daniele.

------------------------------------------------------------------------------
Marat Galyamov
2016-08-29 14:16:24 UTC
Permalink
I use the codec on STM32F4/F7 - I had problems in case of initialization of the codec, in case of function invocation of codec2_samples_per_frame(codec2);
The problem was solved by increase in the size of a stack and heap in settings of the project.
Regards, Marat.
Post by Daniele Barzotti
Hi,
First of all, I'm new to codec2.
I'm developing on a proprietary board based on ATMEL AT32UC3B0512, a
8MByte Spansion Flash and FreeRTOS.
What I need to do is to store 8 hours of voice into the flash.
My input stream is a PCM 16 bit 8Khz and I got 20 dword (1 frame) every
2.5 msec.
Here I have to compress it to a 2 kilobit/second stream. (So here I need
only the encoder part)
I added the codec2 source to my project and it compile with no problems
but, when I run, it hangs on codec2_create().
I tried to create the CODEC2 structure on the stack but with no success.
My question is: someone known if is possible to run codec2 into a 32-bit
RISC?
Thanks in advance,
Daniele.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Bruce Perens
2016-08-29 14:08:25 UTC
Permalink
You need hardware floating point.

On Mon, Aug 29, 2016 at 6:44 AM, Daniele Barzotti <
Post by Daniele Barzotti
Hi,
First of all, I'm new to codec2.
I'm developing on a proprietary board based on ATMEL AT32UC3B0512, a
8MByte Spansion Flash and FreeRTOS.
What I need to do is to store 8 hours of voice into the flash.
My input stream is a PCM 16 bit 8Khz and I got 20 dword (1 frame) every
2.5 msec.
Here I have to compress it to a 2 kilobit/second stream. (So here I need
only the encoder part)
I added the codec2 source to my project and it compile with no problems
but, when I run, it hangs on codec2_create().
I tried to create the CODEC2 structure on the stack but with no success.
My question is: someone known if is possible to run codec2 into a 32-bit
RISC?
Thanks in advance,
Daniele.
------------------------------------------------------------
------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Anomadarshi Barua Shuvro
2016-08-29 14:12:40 UTC
Permalink
Check the memory requirements of the codec _ create() , because it
requires a good amount of heap memory...

On 29 Aug 2016 4:01 p.m., "Daniele Barzotti" <
Post by Daniele Barzotti
Hi,
First of all, I'm new to codec2.
I'm developing on a proprietary board based on ATMEL AT32UC3B0512, a
8MByte Spansion Flash and FreeRTOS.
What I need to do is to store 8 hours of voice into the flash.
My input stream is a PCM 16 bit 8Khz and I got 20 dword (1 frame) every
2.5 msec.
Here I have to compress it to a 2 kilobit/second stream. (So here I need
only the encoder part)
I added the codec2 source to my project and it compile with no problems
but, when I run, it hangs on codec2_create().
I tried to create the CODEC2 structure on the stack but with no success.
My question is: someone known if is possible to run codec2 into a 32-bit
RISC?
Thanks in advance,
Daniele.
------------------------------------------------------------
------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
glen english
2016-08-29 21:29:43 UTC
Permalink
Hi Daniele

I looked at the AT32UC3B0512 manuals

I think you will would be able to run CODEC2 on the versions with the
hardware FPU, there is enough horsepower I think, but you may need to
spend time on generating compiler optimizations. And not waste ANY
memory, and if you try and run freeRTOS, which I am well familiar with,
this will probably consume up the valuable memory you need. I would
recommend you try without the RTOS, and really, if the codec is all you
are running, there is no benefit, only disadvantage.

It is certainly worth a shot if you want to learn a few things. Might
depend on how well the compiler mates to the CPU. Hence the need to
examine the generated assembler.


regards
Post by Daniele Barzotti
Hi,
First of all, I'm new to codec2.
I'm developing on a proprietary board based on ATMEL AT32UC3B0512, a
8MByte Spansion Flash and FreeRTOS.
What I need to do is to store 8 hours of voice into the flash.
My input stream is a PCM 16 bit 8Khz and I got 20 dword (1 frame) every
2.5 msec.
Here I have to compress it to a 2 kilobit/second stream. (So here I need
only the encoder part)
I added the codec2 source to my project and it compile with no problems
but, when I run, it hangs on codec2_create().
I tried to create the CODEC2 structure on the stack but with no success.
My question is: someone known if is possible to run codec2 into a 32-bit
RISC?
Thanks in advance,
Daniele.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
Steve
2016-08-29 22:42:56 UTC
Permalink
If you comment out all the "mode 1200" stuff, you will save beaucoup bytes.

------------------------------------------------------------------------------
Daniele Barzotti
2016-08-30 07:07:58 UTC
Permalink
Hi Glen,

thanks for your suggestions.
Unfortunatelly I cannot avoid to use FreeRTOS because the project is
based entirely on it.
I'm developing on an DMR radio option board and I have to interface it
with the radio, so a lot of code is already written.

Now I will made a try.
Just to know, without to examinate the entire sorce base, you know which
are the parts not related to the encoder?

Thanks,
Daniele.
Post by glen english
Hi Daniele
I looked at the AT32UC3B0512 manuals
I think you will would be able to run CODEC2 on the versions with the
hardware FPU, there is enough horsepower I think, but you may need to
spend time on generating compiler optimizations. And not waste ANY
memory, and if you try and run freeRTOS, which I am well familiar with,
this will probably consume up the valuable memory you need. I would
recommend you try without the RTOS, and really, if the codec is all you
are running, there is no benefit, only disadvantage.
It is certainly worth a shot if you want to learn a few things. Might
depend on how well the compiler mates to the CPU. Hence the need to
examine the generated assembler.
regards
Post by Daniele Barzotti
Hi,
First of all, I'm new to codec2.
I'm developing on a proprietary board based on ATMEL AT32UC3B0512, a
8MByte Spansion Flash and FreeRTOS.
What I need to do is to store 8 hours of voice into the flash.
My input stream is a PCM 16 bit 8Khz and I got 20 dword (1 frame) every
2.5 msec.
Here I have to compress it to a 2 kilobit/second stream. (So here I need
only the encoder part)
I added the codec2 source to my project and it compile with no problems
but, when I run, it hangs on codec2_create().
I tried to create the CODEC2 structure on the stack but with no success.
My question is: someone known if is possible to run codec2 into a 32-bit
RISC?
Thanks in advance,
Daniele.
------------------------------------------------------------------------------
glen english
2016-08-30 07:19:11 UTC
Permalink
Hi Daniele

You should read it all to understand which bits you need.

You will need to understand it sufficiently to have a good chance of
success.

There will be no spare MIPS or RAM to do anything else, I suspect.

cheers
Post by Daniele Barzotti
Hi Glen,
thanks for your suggestions.
Unfortunatelly I cannot avoid to use FreeRTOS because the project is
based entirely on it.
I'm developing on an DMR radio option board and I have to interface it
with the radio, so a lot of code is already written.
Now I will made a try.
Just to know, without to examinate the entire sorce base, you know which
are the parts not related to the encoder?
Thanks,
Daniele.
Post by glen english
Hi Daniele
I looked at the AT32UC3B0512 manuals
I think you will would be able to run CODEC2 on the versions with the
hardware FPU, there is enough horsepower I think, but you may need to
spend time on generating compiler optimizations. And not waste ANY
memory, and if you try and run freeRTOS, which I am well familiar with,
this will probably consume up the valuable memory you need. I would
recommend you try without the RTOS, and really, if the codec is all you
are running, there is no benefit, only disadvantage.
It is certainly worth a shot if you want to learn a few things. Might
depend on how well the compiler mates to the CPU. Hence the need to
examine the generated assembler.
regards
Post by Daniele Barzotti
Hi,
First of all, I'm new to codec2.
I'm developing on a proprietary board based on ATMEL AT32UC3B0512, a
8MByte Spansion Flash and FreeRTOS.
What I need to do is to store 8 hours of voice into the flash.
My input stream is a PCM 16 bit 8Khz and I got 20 dword (1 frame) every
2.5 msec.
Here I have to compress it to a 2 kilobit/second stream. (So here I need
only the encoder part)
I added the codec2 source to my project and it compile with no problems
but, when I run, it hangs on codec2_create().
I tried to create the CODEC2 structure on the stack but with no success.
My question is: someone known if is possible to run codec2 into a 32-bit
RISC?
Thanks in advance,
Daniele.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
Daniele Barzotti
2016-08-30 07:21:31 UTC
Permalink
Hi Glen,

you're right, I will do! :-)

Thanks a lot!

Cheers,
Daniele.
Post by glen english
Hi Daniele
You should read it all to understand which bits you need.
You will need to understand it sufficiently to have a good chance of
success.
There will be no spare MIPS or RAM to do anything else, I suspect.
cheers
Post by Daniele Barzotti
Hi Glen,
thanks for your suggestions.
Unfortunatelly I cannot avoid to use FreeRTOS because the project is
based entirely on it.
I'm developing on an DMR radio option board and I have to interface it
with the radio, so a lot of code is already written.
Now I will made a try.
Just to know, without to examinate the entire sorce base, you know which
are the parts not related to the encoder?
Thanks,
Daniele.
Post by glen english
Hi Daniele
I looked at the AT32UC3B0512 manuals
I think you will would be able to run CODEC2 on the versions with the
hardware FPU, there is enough horsepower I think, but you may need to
spend time on generating compiler optimizations. And not waste ANY
memory, and if you try and run freeRTOS, which I am well familiar with,
this will probably consume up the valuable memory you need. I would
recommend you try without the RTOS, and really, if the codec is all you
are running, there is no benefit, only disadvantage.
It is certainly worth a shot if you want to learn a few things. Might
depend on how well the compiler mates to the CPU. Hence the need to
examine the generated assembler.
regards
Post by Daniele Barzotti
Hi,
First of all, I'm new to codec2.
I'm developing on a proprietary board based on ATMEL AT32UC3B0512, a
8MByte Spansion Flash and FreeRTOS.
What I need to do is to store 8 hours of voice into the flash.
My input stream is a PCM 16 bit 8Khz and I got 20 dword (1 frame) every
2.5 msec.
Here I have to compress it to a 2 kilobit/second stream. (So here I need
only the encoder part)
I added the codec2 source to my project and it compile with no problems
but, when I run, it hangs on codec2_create().
I tried to create the CODEC2 structure on the stack but with no success.
My question is: someone known if is possible to run codec2 into a 32-bit
RISC?
Thanks in advance,
Daniele.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
--
-----------------------------------------------
Daniele Barzotti
Software/Firmware Developer

Mail : ***@eurocomtel.com
Skype : dbarzo (***@gmail.com)
LinkedIn : it.linkedin.com/in/dbarzo
Twitter : twitter.com/DanieleBarzo

EuroCom Telecomunicazioni Srl
Via Carpegna, 9 Riccione 47838 RN Italia.
T:+39.0541.694212 F:+39.0541.694211
www.eurocomtel.com

Motorola Solutions Channel Partner

Eurocom Telecomunicazioni Srl
Sede Legale ed Amministrativa:
Via Carpegna, 9 Riccione 47838 RN Italia, T.: +39.0541.694212 F.:
+39.0541.694211 @: ***@eurocomtel.com
Capitale Sociale: € 41.316,55 I.V. Codice Fiscale e P.IVA: 02067170403
Registro Imprese Rimini: 11886 REA Rimini: 237428

RISERVATEZZA / CONFIDENTIALITY
Le informazioni contenute nella presente comunicazione e i relativi
allegati possono essere riservate e sono, comunque, destinate
esclusivamente alle persone o alla Società sopraindicati. La diffusione,
distribuzione e/o copiatura del documento trasmesso da parte di
qualsiasi soggetto diverso dal destinatario è proibita, sia ai sensi
dell'art. 616 c.p. , che ai sensi del D. Lgs. n. 196/2003. Se avete
ricevuto questo messaggio per errore, vi preghiamo di distruggerlo e di
informarci immediatamente per telefono allo 0541.694212 oppure inviando
un messaggio all'indirizzo: ***@eurocomtel.com. In caso di ricezione
mancata o incompleta, telefonare al numero +39.0541.694212.


Prima di stampare questa pagina verifica che sia necessario. Proteggiamo
l'Ambiente

------------------------------------------------------------------------------
e***@vk5kbb.com
2016-08-30 07:32:06 UTC
Permalink
I've just spent a few weeks ploughing through the source code in order
to port it to different hardware and I have only just begun to scratch
the surface.
What you want to do Daniele is not a trivial task and will take
considerable time.
Good luck.

Regards
Eric
Post by Daniele Barzotti
Hi Glen,
thanks for your suggestions.
Unfortunatelly I cannot avoid to use FreeRTOS because the project is
based entirely on it.
I'm developing on an DMR radio option board and I have to interface it
with the radio, so a lot of code is already written.
Now I will made a try.
Just to know, without to examinate the entire sorce base, you know which
are the parts not related to the encoder?
Thanks,
Daniele.
Hi Daniele I looked at the AT32UC3B0512 manuals I think you will would be able to run CODEC2 on the versions with the hardware FPU, there is enough horsepower I think, but you may need to spend time on generating compiler optimizations. And not waste ANY memory, and if you try and run freeRTOS, which I am well familiar with, this will probably consume up the valuable memory you need. I would recommend you try without the RTOS, and really, if the codec is all you are running, there is no benefit, only disadvantage. It is certainly worth a shot if you want to learn a few things. Might depend on how well the compiler mates to the CPU. Hence the need to examine the generated assembler. regards On 29/08/2016 11:44 PM, Daniele Barzotti wrote: Hi, First of all, I'm new to codec2. I'm developing on a proprietary board based on ATMEL AT32UC3B0512, a 8MByte Spansion Flash and FreeRTOS. What I need to do is to store 8 hours of voice into the flash. My input stream is a PCM 16 bit 8Khz and I
got 20 dword (1 frame) every 2.5 msec. Here I have to compress it to a 2 kilobit/second stream. (So here I need only the encoder part) I added the codec2 source to my project and it compile with no problems but, when I run, it hangs on codec2_create(). I tried to create the CODEC2 structure on the stack but with no success. My question is: someone known if is possible to run codec2 into a 32-bit RISC? Thanks in advance, Daniele.

------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
Freetel-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2 [1]



Links:
------
[1] https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Steve
2016-08-30 17:38:30 UTC
Permalink
...without examining the entire source base, you know which
are the parts not related to the encoder?
I never tested it yet, but I thought it might give you a head start on
which code does encode and which does decode. My goal is to make a
mode 1300 dynamic library.

Although this is for mode 1300 only (I'm building a 1300 only application).

I have yet to compare output to make sure they are the same.

https://github.com/k5okc/vocoder1300

73/steve

------------------------------------------------------------------------------
glen english
2016-09-15 21:55:20 UTC
Permalink
Daniele, how did you go with your port ?

I know alot more about it than 3 weeks ago and might be able to assist /
comment with useful outcomes

regards

glen
Hi,
------------------------------------------------------------------------------
glen english
2016-09-15 23:59:55 UTC
Permalink
codec2 1200, 40mS frames
-O2
all the code appears to be executed, had a bit of a look in the debugger
I'm only running at 168 MHz out of a possible 216 MHz...

40mS frame.
ENCODE 912123 cycles : 5.429mS
DECODE 1299982 cycles : 7.74mS

The M7 excels at this sort of crazy code. (small loops, jumps, misc stuff)

FFT execution time : 32843 cycles

But then after my FFT I do a rather gratuitous memcpy , as the fft is
IN PLACE and I need it get it into the destination array.
another 1630 cycles. Now, that *could* be turned into a DMA mem-mem xfer
or I could just change some of the variable names and there wouldnt
need to be a copy
but I dont want to go changing too much stuff for the sake of 9
microseconds...
the memcpy can stay. Maybe some improvement if I check/fix the
alignment and do the memcpy myself in 64 bit slabs .

Hm decode is hard eh ? sort of counter the usual.

The other thing is, a little more attention to the use of memory usage
and the RAM requirements could probably be halved. If that is useful I
dunno.
Like with the double buffers around the ffts is an extra 4k on the stack .

I'm not using the TCM on the device either yes, and then the data access
would be have to go through the cache. perhaps another 10%....

out of interest, my old sweetheart the ADI SHARC would do the 512 pt
complex fft in perhaps 8000 cycles ?
(compared to this 32000) so it ain't that fast but not bad for a GP micro.

Bruce, (I see your authorness) - your packer compiles well.....

g



------------------------------------------------------------------------------
Danilo Beuche
2016-09-16 01:49:42 UTC
Permalink
H

regarding the times @mcHF (STM32F4, 168Mhz) some clarifications: We
measured 17.3ms per 40ms interval for the voice decode part only (this
is only happening once the modem is synced) and roughly 5ms of
fdmdv_demod per 20ms interval (happens all the time). Which gives us in
total some 27ms per 40ms once synced. This is about 68% load.

Regards,

Danilo





------------------------------------------------------------------------------
glen english
2016-09-16 01:53:34 UTC
Permalink
Hi Danilo

yeah, you have plenty in hand.

OK so M7 and CMSIS FFT, about 2 x speed (same clock) 7.74mS (1200bps)
for decode.
Post by Danilo Beuche
H
measured 17.3ms per 40ms interval for the voice decode part only (this
is only happening once the modem is synced) and roughly 5ms of
fdmdv_demod per 20ms interval (happens all the time). Which gives us in
total some 27ms per 40ms once synced. This is about 68% load.
Regards,
Danilo
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
Danilo Beuche
2016-09-16 02:06:06 UTC
Permalink
Hi Glen,

nice, would be interesting to see how much the STM32F4 gains by use of
CMSIS FFT routines.

BTW, I am not sure, but I think you mentioned the removal of malloc as
one of your changes. For us with the mcHF it would not be good to have
the memory for FreeDV code statically allocated since FreeDV is just one
operation mode of the mcHF, and we need the memory at other times for
other stuff, especially since it really eats a lot of memory (in
relation to the STM32F4 RAM sizes). Even half of it is still a lot.

Looking forward to gain some more free cycles with your work.

Regards,
Danilo
Post by glen english
Hi Danilo
yeah, you have plenty in hand.
OK so M7 and CMSIS FFT, about 2 x speed (same clock) 7.74mS (1200bps)
for decode.
Post by Danilo Beuche
H
measured 17.3ms per 40ms interval for the voice decode part only (this
is only happening once the modem is synced) and roughly 5ms of
fdmdv_demod per 20ms interval (happens all the time). Which gives us in
total some 27ms per 40ms once synced. This is about 68% load.
Regards,
Danilo
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
glen english
2016-09-16 02:13:05 UTC
Permalink
Hi Danilo
Yeah, I guess being a very bare metal programmer from the old 128 byte
RAM days, , I dislike MALLOCs in embedded code on principal.

However, because the heap usage would be deterministic, it should be
fairly safe.

*******************************************************************************
Take a look at the memory management routine heap2.c in freertos.c
(in fact, there are heap1,2,3,4,5 .c - a few options... try heap4, also)
-this is a much smarter memory alloc and dealloc routine that is fairly
cheap.
much better than usual brain dead malloc.
********************************************************************************
I'd recommend using that. It looks for blocks same size, existing used etc

I would expect the same improvements on the F4 as the F7 using the CMSIS
library. The F7 is much faster on that sort of code.

I only got rid of the FFT malloc stuff the huge stack additions are
still in there
and you could save 50% there ...

-glen
Post by Daniele Barzotti
Hi Glen,
nice, would be interesting to see how much the STM32F4 gains by use of
CMSIS FFT routines.
BTW, I am not sure, but I think you mentioned the removal of malloc as
one of your changes. For us with the mcHF it would not be good to have
the memory for FreeDV code statically allocated since FreeDV is just one
operation mode of the mcHF, and we need the memory at other times for
other stuff, especially since it really eats a lot of memory (in
relation to the STM32F4 RAM sizes). Even half of it is still a lot.
Looking forward to gain some more free cycles with your work.
Regards,
Danilo
Post by glen english
Hi Danilo
yeah, you have plenty in hand.
OK so M7 and CMSIS FFT, about 2 x speed (same clock) 7.74mS (1200bps)
for decode.
Post by Danilo Beuche
H
measured 17.3ms per 40ms interval for the voice decode part only (this
is only happening once the modem is synced) and roughly 5ms of
fdmdv_demod per 20ms interval (happens all the time). Which gives us in
total some 27ms per 40ms once synced. This is about 68% load.
Regards,
Danilo
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
Danilo Beuche
2016-09-16 02:26:38 UTC
Permalink
Hi Glen,
Post by glen english
Hi Danilo
Yeah, I guess being a very bare metal programmer from the old 128 byte
RAM days, , I dislike MALLOCs in embedded code on principal.
Well. Although I also remember that are and in general do not like
malloc being used by someone not knowing what she/he does,
it does have its fair use cases.
Post by glen english
However, because the heap usage would be deterministic, it should be
fairly safe.
However, using stack is okay by me, makes it a little harder to monitor
but the codec2 already uses a fair share of stack in some places.
My concern was a migration to memory allocated statically which sits
unused most of the time. This would be a true waste.
Post by glen english
*******************************************************************************
Take a look at the memory management routine heap2.c in freertos.c
(in fact, there are heap1,2,3,4,5 .c - a few options... try heap4, also)
-this is a much smarter memory alloc and dealloc routine that is fairly
cheap.
much better than usual brain dead malloc.
********************************************************************************
I'd recommend using that. It looks for blocks same size, existing used etc
I would expect the same improvements on the F4 as the F7 using the CMSIS
library. The F7 is much faster on that sort of code.
I only got rid of the FFT malloc stuff the huge stack additions are
still in there
and you could save 50% there ...
Hope to see this soon. In order to not do double work here, we'll wait
for your changes to materialize. The mcHF is now in a state where we can
play with FreeDV. While this currently has some impact on the time
available for task like the spectrum display, this is not critical since
we can RX and TX without issues AFAIK.

Regards,
Danilo


------------------------------------------------------------------------------
glen english
2016-09-16 02:39:24 UTC
Permalink
Hi Danilo
I agree with you in the memory constrained environment-

Actually with the change in fft, I have reduced the peak memory
requirements.

the new FFT is in-place.. and all the cmsis needs is a config structure,
as all the twiddles and bit rev tables are the same because it's the
same 512 pt fft everywhere.

The only static vars I've added are :

1x FFT config for all ffts.

struct CODEC2 lives as a static. it's persistent and I don't need more
than one instance.

The old FFT is inplace and where it is used there are input and output
complex float arrays or size 4k each, total 8 kbytes added to the stack.

I can't see anywhere that input fft array is used after the fft, so I
could get rid of one of those buffers saving 4kbytes on the stack,
that's significant.

the CMSIS FFT is also very very frugle on mem usage compared with KISS.
Overall the memory footprint I think would be 4k to 6 kbytes at least
down, not counting how much the kiss-fft eats the stac but it was rather
reentrant so.. yeah perhaps 8kbytes extra in hand, maybe more ???.


I've also changed a few structs to typedef stuct...... and named them
structname_t
I dislike old C struct syntax. I always do typedef struct {} xyz_t.

-g
Post by Daniele Barzotti
Hi Glen,
Post by glen english
Hi Danilo
Yeah, I guess being a very bare metal programmer from the old 128 byte
RAM days, , I dislike MALLOCs in embedded code on principal.
Well. Although I also remember that are and in general do not like
malloc being used by someone not knowing what she/he does,
it does have its fair use cases.
Post by glen english
However, because the heap usage would be deterministic, it should be
fairly safe.
However, using stack is okay by me, makes it a little harder to monitor
but the codec2 already uses a fair share of stack in some places.
My concern was a migration to memory allocated statically which sits
unused most of the time. This would be a true waste.
Post by glen english
*******************************************************************************
Take a look at the memory management routine heap2.c in freertos.c
(in fact, there are heap1,2,3,4,5 .c - a few options... try heap4, also)
-this is a much smarter memory alloc and dealloc routine that is fairly
cheap.
much better than usual brain dead malloc.
********************************************************************************
I'd recommend using that. It looks for blocks same size, existing used etc
I would expect the same improvements on the F4 as the F7 using the CMSIS
library. The F7 is much faster on that sort of code.
I only got rid of the FFT malloc stuff the huge stack additions are
still in there
and you could save 50% there ...
Hope to see this soon. In order to not do double work here, we'll wait
for your changes to materialize. The mcHF is now in a state where we can
play with FreeDV. While this currently has some impact on the time
available for task like the spectrum display, this is not critical since
we can RX and TX without issues AFAIK.
Regards,
Danilo
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
glen english
2016-09-16 05:51:05 UTC
Permalink
I had not used the fft before, had used other in the package

refer to
http://www.keil.com/pack/doc/CMSIS/DSP/html/index.html
in particular : "Using the Library"

you must include <arm_math.h>

and have all the switches in to get the processor to do its fancy work.
In fact most people I see don't use the VFP switches so the micro is
often not using the FPU at all !

I am using
arm_cortexM7lfsp_math.lib (Little endian and Single Precision Floating
Point Unit on Cortex-M7)

F4 will use
arm_cortexM4lf_math.lib (Little endian and Floating Point Unit on Cortex-M4)

then refer to

http://www.keil.com/pack/doc/CMSIS/DSP/html/group___complex_f_f_t.html

in particular
http://www.keil.com/pack/doc/CMSIS/DSP/html/group___complex_f_f_t.html#gade0f9c4ff157b6b9c72a1eafd86ebf80

which is arm_cfft_f32()

for each different CONFIG
you only need one struct :

http://www.keil.com/pack/doc/CMSIS/DSP/html/structarm__cfft__instance__f32.html

the twiddle and BR tables must be initialized

I had to rename globals defines N and M as they clashed with CMIS or
soemthing else
so I renamed N to NSPF
(rom defines.h>
#define NSPF 80 /* number of samples per frame */
and M to M_PAFS
#define M_PAFS 320 /* pitch analysis frame size */


and because there are no spaces around variable names (David ! take
note ! . mutter mutter )

you need to go in every file and manually replace or not....

(if you just do a global find replace , the editor can't tell the
difference 'cause there are no spaces around variable names )

now, next thing :

there is an example here
http://www.keil.com/pack/doc/CMSIS/DSP/html/group___frequency_bin.html

in particular
http://www.keil.com/pack/doc/CMSIS/DSP/html/arm_fft_bin_example_f32_8c-example.html

you'll need to go through and change all the kiss_fft type decs in the
functions and their definitions (or macroize it all) .

everywhere you find

change it to arm_cfft_instance_f32
like the codec2 struct has references to kiss_fft you need to change
it to a arm_cfft_instance_f32 *

also all the instances kiss_fft_cpx type , when kiss_fft goes you need
to come up with something different

however in the code there is COMP type - this is a macro definition
that does a simialr job.
there is a mix of COMP type, which is two floats, and just float *

My command line flags :

gcc/arm-none-eabi/bin/as" --traditional-format -mcpu=cortex-m7
-mlittle-endian -mfloat-abi=softfp -mfpu=fpv5-sp-d16 -mthumb
"F:/projects/kkk/K15-001-TRAD/sw/rev1/K15-001 THUMB Debug/tmpa06908" -o
"K15-001 THUMB Debug/nlp.o"

linker (truncated)
gcc/arm-none-eabi/bin/ld" -X -ereset_handler --omagic
-defsym=__vfprintf=__vfprintf_float_long_long
-defsym=__vfscanf=__vfscanf_float_long_long_cc --fatal-warnings -EL
--gc-sections "-TF:/projects/kkk/K15-001-TRAD/sw/rev1/K15-001 THUMB
Debug/K15-001.ld" -Map "K15-001 THUMB Debug/K15-001.map" -u_vectors -o
"K15-001 THUMB Debug/K15-001.elf" --start-group "K15-001 THUMB
Debug/main.o" "K15-001 THUMB Debug/hw_init.o" "K15-001 THUMB Debug/si4460.o"



attached
some help
glen english
2016-09-16 09:09:34 UTC
Permalink
Hi everybody
just looking through what code is doing what
and the test and post _ process_mbe appear to not be used

What's the story behind POST_PROCESS_MBE and the undef code segment- unused.

?

I see post_process_sub_multiples is used, instead.

I observe the post_process_mbe path is more intensive.

I should pull out one of my old C ++ flowcharting programs or is their a
sourceforge these days ????


g



------------------------------------------------------------------------------
Dana Myers
2016-09-16 18:59:52 UTC
Permalink
Subject: Re: [Freetel-codec2] more benching and thoughts
Date: Fri, 16 Sep 2016 12:13:05 +1000
Hi Danilo
Yeah, I guess being a very bare metal programmer from the old 128 byte
RAM days, , I dislike MALLOCs in embedded code on principal.
I'm similar, though now that we have 32kB, 64kB (or even more) RAM
in embedded chips, they're basically like the systems that malloc()
was initially built on :-)

I still don't trust C++ heap allocators in embedded applications, though.
However, because the heap usage would be deterministic, it should be
fairly safe.
I have a similar project; a 1200 baud modem + TNC stack built in a PSoC 5LP.
I use the Delta-Sigma ADC @ 9600s/s , 16-bits, the PSoC Digital Filter Block
to do bandpass heavy-lifting and frequency response correction for the ADC.
I use CMSIS-DSP for the rest of the DSP crunching required, and, wait for it,
wrap the whole thing in FreeRTOS (9.0.0 now). I use q31_t for all the DSP,
as long as I'm careful to avoid blowing out past the +/- 1.0 range, it's probably
every bit as good as single-precision floating point. The PSoC 5LP has a Cortex-M3,
I'm running it at 80MHz.

My dynamic buffer implementation uses ...
*******************************************************************************
Take a look at the memory management routine heap2.c in freertos.c
(in fact, there are heap1,2,3,4,5 .c - a few options... try heap4, also)
-this is a much smarter memory alloc and dealloc routine that is fairly
cheap.
much better than usual brain dead malloc.
********************************************************************************
I'd recommend using that. It looks for blocks same size, existing used etc
heap_4.c and, while I've never explicitly profiled it, I've never had a reason to
suspect the allocator is misbehaving. I commit 32KB to the heap and currently
never use more than about 3KB.
I would expect the same improvements on the F4 as the F7 using the CMSIS
library. The F7 is much faster on that sort of code.
I only got rid of the FFT malloc stuff the huge stack additions are
still in there
and you could save 50% there ...
Without knowing the details of this application (I'm new here), I am quite
impressed with the quality of CMSIS-DSP, particularly in terms of exploiting
the ARM extensions.

Cheers,
Dana
-glen
Post by Daniele Barzotti
Hi Glen,
nice, would be interesting to see how much the STM32F4 gains by use of
CMSIS FFT routines.
BTW, I am not sure, but I think you mentioned the removal of malloc as
one of your changes. For us with the mcHF it would not be good to have
the memory for FreeDV code statically allocated since FreeDV is just one
operation mode of the mcHF, and we need the memory at other times for
other stuff, especially since it really eats a lot of memory (in
relation to the STM32F4 RAM sizes). Even half of it is still a lot.
Looking forward to gain some more free cycles with your work.
Regards,
Danilo
Post by glen english
Hi Danilo
yeah, you have plenty in hand.
OK so M7 and CMSIS FFT, about 2 x speed (same clock) 7.74mS (1200bps)
for decode.
Post by Danilo Beuche
H
measured 17.3ms per 40ms interval for the voice decode part only (this
is only happening once the modem is synced) and roughly 5ms of
fdmdv_demod per 20ms interval (happens all the time). Which gives us in
total some 27ms per 40ms once synced. This is about 68% load.
Danilo Beuche
2016-09-18 00:40:31 UTC
Permalink
Hi,

I'd like to share a few thoughts/ideas as well:

Since we finished and removed most of the easy to remove performance
hotspots with the exception of the kiss_fft calls which now contribute a
major part to the overall runtime of modem and decoder, I played around
with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs.

Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32)
the time difference is not existing if used in our mcHF code. How does
this relate to the measurements of Glen (he measure better performance)?
I believe this is due to the fact that the arm lib stores some of its
data in precomputed flash arrays. Access to flash is slow (5 wait
states), so this reduces the performance. kiss has all its data in RAM.
Since Glen did initialize the arm DSP tables in RAM, he got speed gains
on the expense of RAM. On a STM32F746 RAM this is not as much of an
issue (384K) as it is on the STM32F4 (the default MCU in the mcHF
project has 192K RAM and we have to fit the full SDR firmware RAM needs
into it this space). Speed is traded for RAM use reduction. Since have
reached our goal timewise, now memory reductions is more in focus for us
(but it should not get slower of course).

Because of that I would like to propose the following approach to keep
the code easily readable while providing efficient solutions for the ARM
MCUs both with little and more RAM:

1. We create an abstract interface for running fft in the codec2
sources. Initially this should closely resemble the existing kiss_fft
calls, which makes introducing the interface easy, since we may
use all the existing tests and can verify that introducing the interface
does not change a bit in the output.

2. Once validated, we can now introduce/activate the use of the arm DSP
FFT with some glue code to map between the abstract interface and the
arm DSP interface. Here again we have to validate everything is working
nicely but we will see some slight differences I assume. However, with
it we can produce reference data for step 3

3. Now we modify the existing code so that we can benefit from some nice
properties of the arm DSP fft (inplace FFT) which means this will reduce
RAM usage significantly (in relation to 192K RAM).

4. We enable optional use of RAM instead of flash in the ARM code, so
that depending on the amount of available memory you can get some extra
boost.

For that to work nicely, we have to fix some issues in the existing code
first, so here comes

0. As Glen pointed out, some of the #define constant have not so good
names, especially M (defined for 2 different purposes in defines.h and
fdmdv_internal.h) is nasty and also N in defines.h (there are some local
variables N and the stm headers also get confused by it). So we need to
change these to something unambiguous. I think Glen already suggest
names for them.


And I would like to point out, that the use of dynamic memory allocation
(malloc/free) is necessary in our mcHF case, so I would like to keep
this more or less as it is. The mcHF needs the ability to reuse the
memory for other operational modes, if FreeDV is not active. Which does
not mean I am against removing the internal use of malloc, but then it
should be possible to easily create the required data structures
"outside" the code using malloc. I.e. the use of static data structures
for anything but const data is a no go.

To support that discuss I created a draft suggestion for the interface
(attached to this mail). It is right now defined using inline code for
the sake of simplicity. This may change later, I don't think there
should be any issue with that. It essentially contains 3 functions for
complex fft and real fft each (alloc,fft,free) and the necessary data
structures.

Danilo
Post by Dana Myers
Subject: Re: [Freetel-codec2] more benching and thoughts
Date: Fri, 16 Sep 2016 12:13:05 +1000
Hi Danilo
Yeah, I guess being a very bare metal programmer from the old 128 byte
RAM days, , I dislike MALLOCs in embedded code on principal.
I'm similar, though now that we have 32kB, 64kB (or even more) RAM
in embedded chips, they're basically like the systems that malloc()
was initially built on :-)
I still don't trust C++ heap allocators in embedded applications, though.
However, because the heap usage would be deterministic, it should be
fairly safe.
I have a similar project; a 1200 baud modem + TNC stack built in a PSoC 5LP.
to do bandpass heavy-lifting and frequency response correction for the ADC.
I use CMSIS-DSP for the rest of the DSP crunching required, and, wait for it,
wrap the whole thing in FreeRTOS (9.0.0 now). I use q31_t for all the DSP,
as long as I'm careful to avoid blowing out past the +/- 1.0 range, it's probably
every bit as good as single-precision floating point. The PSoC 5LP has a Cortex-M3,
I'm running it at 80MHz.
My dynamic buffer implementation uses ...
*******************************************************************************
Take a look at the memory management routine heap2.c in freertos.c
(in fact, there are heap1,2,3,4,5 .c - a few options... try heap4, also)
-this is a much smarter memory alloc and dealloc routine that is fairly
cheap.
much better than usual brain dead malloc.
********************************************************************************
I'd recommend using that. It looks for blocks same size, existing used etc
heap_4.c and, while I've never explicitly profiled it, I've never had a reason to
suspect the allocator is misbehaving. I commit 32KB to the heap and currently
never use more than about 3KB.
I would expect the same improvements on the F4 as the F7 using the CMSIS
library. The F7 is much faster on that sort of code.
I only got rid of the FFT malloc stuff the huge stack additions are
still in there
and you could save 50% there ...
Without knowing the details of this application (I'm new here), I am quite
impressed with the quality of CMSIS-DSP, particularly in terms of exploiting
the ARM extensions.
Cheers,
Dana
-glen
Post by Daniele Barzotti
Hi Glen,
nice, would be interesting to see how much the STM32F4 gains by use of
CMSIS FFT routines.
BTW, I am not sure, but I think you mentioned the removal of malloc as
one of your changes. For us with the mcHF it would not be good to have
the memory for FreeDV code statically allocated since FreeDV is just one
operation mode of the mcHF, and we need the memory at other times for
other stuff, especially since it really eats a lot of memory (in
relation to the STM32F4 RAM sizes). Even half of it is still a lot.
Looking forward to gain some more free cycles with your work.
Regards,
Danilo
Post by glen english
Hi Danilo
yeah, you have plenty in hand.
OK so M7 and CMSIS FFT, about 2 x speed (same clock) 7.74mS (1200bps)
for decode.
Post by Danilo Beuche
H
measured 17.3ms per 40ms interval for the voice decode part only (this
is only happening once the modem is synced) and roughly 5ms of
fdmdv_demod per 20ms interval (happens all the time). Which gives us in
total some 27ms per 40ms once synced. This is about 68% load.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
glen english
2016-09-18 01:12:45 UTC
Permalink
Hi Danilo

Good thoughts and points.

while on the RAM subject : require reading for every serious programmer :
"Memory"
https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich

read all 7 parts, 100 pages, but if you only have an hour, just read
"part 2 - cache"

On the fft:

I am not surprised that the ARM lib hand optimized assembler is that
much faster.
more that 2x faster.... in fact.

I don't think kiss-fft is particular suitable for this sort of platform,
either, I'll hold back what I really think :-) .

The 5WS on flash (actually 6WS I am running @ 168M) does not really
affect the performance too much. In fact I can vary the WS count +/- 2
without much change- the ART and the prefetch and the instruction and
data caches are doing their job, so there is very little difference with
the const values in ram or cache.

In fact, most FFT implementations are very tough on a machine with cache .
Have you read the paper on how FFTW works ? It is very cache aware- and
adaptive to the architecture- that is why it does trial runs and picks
the best.

The M7 is very impressive. It is certainly impressive work by ARM.

However, the M4 is what all of you have to work with so we can stay
focussed on that.

I think also the ram usage will be significantly less with the arm FFT
because of the re-entrant Kiss-fft behaviour.

The m4 is quite a different beast, and no D-cache can improve
performance over the M7 for some (inaptly) written applications (not
this one- but as a generalization for applications grabbing a byte from
memory randomly and all over a large dataset)

Large matrix operations are where cache machines fall over- that is once
the dataset is bigger than the cache....

The question is how much optimization is enough. I am tempted NOT to
optimize any more, although I feel (just by looking at it ) I can get
another 2x out of it..... Why- well there is no real pressing need.
Going too far away from the reference code will island the code a bit.
However, if you run out of modem cycles/ modem ram, then we can probably
get a bit more...

cheers









------------------------------------------------------------------------------
Danilo Beuche
2016-09-18 01:38:43 UTC
Permalink
Hi Glen,
Post by glen english
Hi Danilo
Good thoughts and points.
"Memory"
https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich
read all 7 parts, 100 pages, but if you only have an hour, just read
"part 2 - cache"
I am not surprised that the ARM lib hand optimized assembler is that
much faster.
more that 2x faster.... in fact.
As I said on the m4 kiss fft and arm dsp are on par, not much difference.
And using RAM vs. flash makes a lot of difference on the mcHF M4 Code
for code which uses tables heavily. We gained a lot from removing the
need to go to flash twice in the fir_filter vs. fir_filter2.
Maybe the mcHF startup configuration is not enabling all the caches.
Will check that.
Post by glen english
I don't think kiss-fft is particular suitable for this sort of platform,
either, I'll hold back what I really think :-) .
affect the performance too much. In fact I can vary the WS count +/- 2
without much change- the ART and the prefetch and the instruction and
data caches are doing their job, so there is very little difference with
the const values in ram or cache.
In fact, most FFT implementations are very tough on a machine with cache .
Have you read the paper on how FFTW works ? It is very cache aware- and
adaptive to the architecture- that is why it does trial runs and picks
the best.
The M7 is very impressive. It is certainly impressive work by ARM.
However, the M4 is what all of you have to work with so we can stay
focussed on that.
I think also the ram usage will be significantly less with the arm FFT
because of the re-entrant Kiss-fft behaviour.
The m4 is quite a different beast, and no D-cache can improve
performance over the M7 for some (inaptly) written applications (not
this one- but as a generalization for applications grabbing a byte from
memory randomly and all over a large dataset)
Large matrix operations are where cache machines fall over- that is once
the dataset is bigger than the cache....
The question is how much optimization is enough. I am tempted NOT to
optimize any more, although I feel (just by looking at it ) I can get
another 2x out of it..... Why- well there is no real pressing need.
Going too far away from the reference code will island the code a bit.
However, if you run out of modem cycles/ modem ram, then we can probably
get a bit more...
cheers
Yes, unfortunately we have a M4 at hand, so a little more RAM would be
nice :-(

Regards,
Danilo
Post by glen english
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
Danilo Beuche
2016-09-18 02:13:32 UTC
Permalink
Hi Glen,

just checked, it seems to me that we have all the caches running in the
mcHF:

https://github.com/df8oe/mchf-github/blob/670f94a2e69a55a03f099ad25390925c84c09201/mchf-eclipse/cmsis_boot/system_stm32f4xx.c#L424

So even with these caches/buffers enabled, the M4 looses performance by
data reads from flash. Haven't checked the manual/internet but maybe the
flash caching works well for code but not in the same way for data.


Danilo
Post by glen english
Hi Danilo
Good thoughts and points.
"Memory"
https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich
read all 7 parts, 100 pages, but if you only have an hour, just read
"part 2 - cache"
I am not surprised that the ARM lib hand optimized assembler is that
much faster.
more that 2x faster.... in fact.
I don't think kiss-fft is particular suitable for this sort of platform,
either, I'll hold back what I really think :-) .
affect the performance too much. In fact I can vary the WS count +/- 2
without much change- the ART and the prefetch and the instruction and
data caches are doing their job, so there is very little difference with
the const values in ram or cache.
In fact, most FFT implementations are very tough on a machine with cache .
Have you read the paper on how FFTW works ? It is very cache aware- and
adaptive to the architecture- that is why it does trial runs and picks
the best.
The M7 is very impressive. It is certainly impressive work by ARM.
However, the M4 is what all of you have to work with so we can stay
focussed on that.
I think also the ram usage will be significantly less with the arm FFT
because of the re-entrant Kiss-fft behaviour.
The m4 is quite a different beast, and no D-cache can improve
performance over the M7 for some (inaptly) written applications (not
this one- but as a generalization for applications grabbing a byte from
memory randomly and all over a large dataset)
Large matrix operations are where cache machines fall over- that is once
the dataset is bigger than the cache....
The question is how much optimization is enough. I am tempted NOT to
optimize any more, although I feel (just by looking at it ) I can get
another 2x out of it..... Why- well there is no real pressing need.
Going too far away from the reference code will island the code a bit.
However, if you run out of modem cycles/ modem ram, then we can probably
get a bit more...
cheers
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
glen english
2016-09-18 03:22:13 UTC
Permalink
Hi Danilo

Yes, there is most-certainly a penalty for const access from flash, at
least on the M4.

and of course instruction cache is no use, is just that, only for
instructions

I wonder what the bus matrix penalty is for data fetches from flash.
There is an app-note about it somewhere I once read. It depends on what
else is going on. The processor is pretty smart and interleaving the
accesses as not to stall the pipeline or bus matrix.

As Danilo I am sure you know : (pointed out for others)
You can force variables into sections (like forcing a static const ) by
using

__attribute__((section("name"))) to assign say into data.

I will some time run up the code on an M4 and see what kiss-fft does.

I am very very very surprised , and do not really believe that kissFFT
is as fast as the arm assembler on the M4 - my immediate thoughts are
"you are doing it wrong". so, I will investigate.

g
Post by Daniele Barzotti
Hi Glen,
just checked, it seems to me that we have all the caches running in the
https://github.com/df8oe/mchf-github/blob/670f94a2e69a55a03f099ad25390925c84c09201/mchf-eclipse/cmsis_boot/system_stm32f4xx.c#L424
So even with these caches/buffers enabled, the M4 looses performance by
data reads from flash. Haven't checked the manual/internet but maybe the
flash caching works well for code but not in the same way for data.
Danilo
Post by glen english
Hi Danilo
Good thoughts and points.
"Memory"
https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich
read all 7 parts, 100 pages, but if you only have an hour, just read
"part 2 - cache"
I am not surprised that the ARM lib hand optimized assembler is that
much faster.
more that 2x faster.... in fact.
I don't think kiss-fft is particular suitable for this sort of platform,
either, I'll hold back what I really think :-) .
affect the performance too much. In fact I can vary the WS count +/- 2
without much change- the ART and the prefetch and the instruction and
data caches are doing their job, so there is very little difference with
the const values in ram or cache.
In fact, most FFT implementations are very tough on a machine with cache .
Have you read the paper on how FFTW works ? It is very cache aware- and
adaptive to the architecture- that is why it does trial runs and picks
the best.
The M7 is very impressive. It is certainly impressive work by ARM.
However, the M4 is what all of you have to work with so we can stay
focussed on that.
I think also the ram usage will be significantly less with the arm FFT
because of the re-entrant Kiss-fft behaviour.
The m4 is quite a different beast, and no D-cache can improve
performance over the M7 for some (inaptly) written applications (not
this one- but as a generalization for applications grabbing a byte from
memory randomly and all over a large dataset)
Large matrix operations are where cache machines fall over- that is once
the dataset is bigger than the cache....
The question is how much optimization is enough. I am tempted NOT to
optimize any more, although I feel (just by looking at it ) I can get
another 2x out of it..... Why- well there is no real pressing need.
Going too far away from the reference code will island the code a bit.
However, if you run out of modem cycles/ modem ram, then we can probably
get a bit more...
cheers
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
Danilo Beuche
2016-09-18 03:51:00 UTC
Permalink
Hi Glen,
Post by glen english
Hi Danilo
Yes, there is most-certainly a penalty for const access from flash, at
least on the M4.
and of course instruction cache is no use, is just that, only for
instructions
Hence the name :-)
Post by glen english
I wonder what the bus matrix penalty is for data fetches from flash.
There is an app-note about it somewhere I once read. It depends on what
else is going on. The processor is pretty smart and interleaving the
accesses as not to stall the pipeline or bus matrix.
As Danilo I am sure you know : (pointed out for others)
You can force variables into sections (like forcing a static const ) by
using
__attribute__((section("name"))) to assign say into data.
Yes, indeed. We use that extensively at the mcHF to move certain parts
to the CCM memory which is otherwise not so easily accessible (and
should be used with care as it does not support DMA to/from peripherals).
Works great.
Post by glen english
I will some time run up the code on an M4 and see what kiss-fft does.
I am very very very surprised , and do not really believe that kissFFT
is as fast as the arm assembler on the M4 - my immediate thoughts are
"you are doing it wrong". so, I will investigate.
Would be happy if I am wrong. Which would mean we get even faster fft on
the M4 almost for free (minus investigation time that is). No problem at
all with me :-)


Danilo



------------------------------------------------------------------------------
glen english
2016-09-18 03:57:02 UTC
Permalink
kiss fft per standard codec2 code...

-O2, debug level2.
encode 3200 (20mS) frame
encode : 1388964 cycles: 8.26mS
decode : 1807440 cycles : 10.75mS

encode 1200 (40mS frame)
3006497 (17.89mS)
decode
3669321 (21.8mS)

hmm seems pretty slow
I wonder what I am doing wrong ?
Or is that inline with other's measurments ?

kissFFT512 : 155,483 cycles (versus 70,000 on the M7)

Next... arm asm on F4 (STM32F405RGT6)




------------------------------------------------------------------------------
glen english
2016-09-18 04:07:34 UTC
Permalink
arm fft, stm32F405RGT6, 6WS

-O2, debug level2.

encode 1200 (40mS frame)
1745799 : 10.39mS
decode
2292497 : 13.64mS

so, 2x speed of my other runs

let's run 5WS

encode 1200 (40mS frame)
1579922 : 9.4mS
decode
2073979 : 12.34mS

WOW the wait states hurt on the M4 !!!!






------------------------------------------------------------------------------
Danilo Beuche
2016-09-18 04:24:53 UTC
Permalink
Hi Glen,

what and how are you measuring, in order to repeat measurements in my
environment.

We use the cycle counter of the CORTEX processor (seems you do the
same), interrupt overhead is removed (counter is stopped during the
major interrupt, which is the audio irq).

I have some prerecorded "frames" stored in flash which are feed into the
freedv_comprx routine. Measurements are taken every 50 frames, since
initially the decoder needs to lock on the data.

After 50 frames I get stable measurements.

I did some hotspot analysis and a single kiss_fft 512 fft call was
taking around 670uS (*168) = 112500 cycles.

So this is in line with your numbers.

My results for kiss_fft vs arm_fft may not be correct, since in this
case fact I was running kiss_fftr vs. arm_rfft_fast_f32.

Maybe there is a huge penalty for the arm_rfft_fast_f32 code.

Let me try the kiss_fft vs. arm_cfft case to confirm your measurements.

Danilo
Post by glen english
arm fft, stm32F405RGT6, 6WS
-O2, debug level2.
encode 1200 (40mS frame)
1745799 : 10.39mS
decode
2292497 : 13.64mS
so, 2x speed of my other runs
let's run 5WS
encode 1200 (40mS frame)
1579922 : 9.4mS
decode
2073979 : 12.34mS
WOW the wait states hurt on the M4 !!!!
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
glen english
2016-09-18 04:35:29 UTC
Permalink
Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9

using cycle counter (yes)

interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs
only set off flags...

for kissfft 5ws F4, I wonder why you have 112500 cycles and I have
141000. Something for me to look at .

hmm

-O2 but I also have a bunch of debug symbol stuff in there dunno I think
it is only symbol data at DB2 which pushes up the image size.




------------------------------------------------------------------------------
Danilo Beuche
2016-09-18 04:58:05 UTC
Permalink
Hi Glen, I would not worry to much:

- Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end
you are looking). It is a lot but not unexplainable.

- Maybe it is my test data. I don't know how much jitter in the kiss_fft
algorithm is, when different data is presented. I am running
"artificially" generated audio input (digitally captured codec2 frames
from a single 750Hz sine way also generated digitally).-

- Maybe it is my strange way of running the mcHF firmware: the mcHF
Hardware has a 16Mhz XO, but the discovery board which I have here for
testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So
everything takes twice the time. If the flash would asynchronously
coupled, which I doubt (otherwise no need for explicit wait state
settings), it would have an influence. But here I am quite sure, this
is not the case. If the caches are asynchronous: Maybe. Maybe I should
remeasure with fixed PLL setup so that the processor runs at true
168Mhz. Will do that later and get back with updated numbers.

Danilo
Post by glen english
Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9
using cycle counter (yes)
interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs
only set off flags...
for kissfft 5ws F4, I wonder why you have 112500 cycles and I have
141000. Something for me to look at .
hmm
-O2 but I also have a bunch of debug symbol stuff in there dunno I think
it is only symbol data at DB2 which pushes up the image size.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
glen english
2016-09-18 05:15:19 UTC
Permalink
Hi
OK.
well, anyway, decode 1200 (40mS) takes 12.34mS on my kit, and 19.86
using kiss-fft.
I think you approximated about 14.4mS for a decode 1300 on your kit

so, I will be interested to see what you come up with using cfft

My codec 2 codebase is AUGUST 2015

cheers
Post by Danilo Beuche
- Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end
you are looking). It is a lot but not unexplainable.
- Maybe it is my test data. I don't know how much jitter in the kiss_fft
algorithm is, when different data is presented. I am running
"artificially" generated audio input (digitally captured codec2 frames
from a single 750Hz sine way also generated digitally).-
- Maybe it is my strange way of running the mcHF firmware: the mcHF
Hardware has a 16Mhz XO, but the discovery board which I have here for
testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So
everything takes twice the time. If the flash would asynchronously
coupled, which I doubt (otherwise no need for explicit wait state
settings), it would have an influence. But here I am quite sure, this
is not the case. If the caches are asynchronous: Maybe. Maybe I should
remeasure with fixed PLL setup so that the processor runs at true
168Mhz. Will do that later and get back with updated numbers.
Danilo
Post by glen english
Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9
using cycle counter (yes)
interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs
only set off flags...
for kissfft 5ws F4, I wonder why you have 112500 cycles and I have
141000. Something for me to look at .
hmm
-O2 but I also have a bunch of debug symbol stuff in there dunno I think
it is only symbol data at DB2 which pushes up the image size.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
Danilo Beuche
2016-09-18 06:00:43 UTC
Permalink
Hi Glen,

I just verified, the cycle counts for true 16Mhz are within 1% of the
8Mhz operations (but measurements now take half as long :-) )

Danilo
Post by glen english
Hi
OK.
well, anyway, decode 1200 (40mS) takes 12.34mS on my kit, and 19.86
using kiss-fft.
I think you approximated about 14.4mS for a decode 1300 on your kit
so, I will be interested to see what you come up with using cfft
My codec 2 codebase is AUGUST 2015
cheers
Post by Danilo Beuche
- Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end
you are looking). It is a lot but not unexplainable.
- Maybe it is my test data. I don't know how much jitter in the kiss_fft
algorithm is, when different data is presented. I am running
"artificially" generated audio input (digitally captured codec2 frames
from a single 750Hz sine way also generated digitally).-
- Maybe it is my strange way of running the mcHF firmware: the mcHF
Hardware has a 16Mhz XO, but the discovery board which I have here for
testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So
everything takes twice the time. If the flash would asynchronously
coupled, which I doubt (otherwise no need for explicit wait state
settings), it would have an influence. But here I am quite sure, this
is not the case. If the caches are asynchronous: Maybe. Maybe I should
remeasure with fixed PLL setup so that the processor runs at true
168Mhz. Will do that later and get back with updated numbers.
Danilo
Post by glen english
Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9
using cycle counter (yes)
interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs
only set off flags...
for kissfft 5ws F4, I wonder why you have 112500 cycles and I have
141000. Something for me to look at .
hmm
-O2 but I also have a bunch of debug symbol stuff in there dunno I think
it is only symbol data at DB2 which pushes up the image size.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
glen english
2016-09-19 02:04:16 UTC
Permalink
good.

well that would seem if a 1200 decode+kissfft is taking about 14mS on
your kit (GCC 5.4) , and decode1200+armcfft is taking 12.34mS on my kit
(GCC 4.9) , then we have won at least 2mS based on that.... which is
worthwhile. (and a fair bit of RAM) . improvement likely to be greater
if I ran GCC 5.4

g
Post by Daniele Barzotti
Hi Glen,
I just verified, the cycle counts for true 16Mhz are within 1% of the
8Mhz operations (but measurements now take half as long :-) )
Danilo
Post by glen english
Hi
OK.
well, anyway, decode 1200 (40mS) takes 12.34mS on my kit, and 19.86
using kiss-fft.
I think you approximated about 14.4mS for a decode 1300 on your kit
so, I will be interested to see what you come up with using cfft
My codec 2 codebase is AUGUST 2015
cheers
Post by Danilo Beuche
- Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end
you are looking). It is a lot but not unexplainable.
- Maybe it is my test data. I don't know how much jitter in the kiss_fft
algorithm is, when different data is presented. I am running
"artificially" generated audio input (digitally captured codec2 frames
from a single 750Hz sine way also generated digitally).-
- Maybe it is my strange way of running the mcHF firmware: the mcHF
Hardware has a 16Mhz XO, but the discovery board which I have here for
testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So
everything takes twice the time. If the flash would asynchronously
coupled, which I doubt (otherwise no need for explicit wait state
settings), it would have an influence. But here I am quite sure, this
is not the case. If the caches are asynchronous: Maybe. Maybe I should
remeasure with fixed PLL setup so that the processor runs at true
168Mhz. Will do that later and get back with updated numbers.
Danilo
Post by glen english
Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9
using cycle counter (yes)
interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs
only set off flags...
for kissfft 5ws F4, I wonder why you have 112500 cycles and I have
141000. Something for me to look at .
hmm
-O2 but I also have a bunch of debug symbol stuff in there dunno I think
it is only symbol data at DB2 which pushes up the image size.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
Danilo Beuche
2016-09-18 04:09:05 UTC
Permalink
Hi Glen,

difficult to say:

I have data for freedv_comprx() for FreeDV 1600 (translates to decode
1300 because of use of some bits for data transmission) which does
modem decode every 20ms and voice decode very 40ms.

On average I have 12.2 ms -> 24.4ms, 2x5ms are the fdmdv_demod part, so
decode1300 is 14.4ms.

I am using gcc 5.4, O2 mostly (some files have O3 enable, but none of
the codec2 files.

And I run more or less newest SVN code (codec2-dev 2875)

If you run somewhat older code, the numbers you gave have been the
performance a few days ago. As David mentioned, we gained about 40%
since r2842 (or so). So they could be right.

Danilo
Post by glen english
kiss fft per standard codec2 code...
-O2, debug level2.
encode 3200 (20mS) frame
encode : 1388964 cycles: 8.26mS
decode : 1807440 cycles : 10.75mS
encode 1200 (40mS frame)
3006497 (17.89mS)
decode
3669321 (21.8mS)
hmm seems pretty slow
I wonder what I am doing wrong ?
Or is that inline with other's measurments ?
kissFFT512 : 155,483 cycles (versus 70,000 on the M7)
Next... arm asm on F4 (STM32F405RGT6)
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
glen english
2016-09-18 04:17:37 UTC
Permalink
-O2, debug level2
1200 (40mS frame)
5 wait states

STM32F405RGT6

kissfft
encode : 2780314 16.5mS
decode : 3335953 19.85mS
kissFTT512cpxfloat : 141,032 cycles

ARMfft
encode 1579922 : 9.4mS
decode 2073979 : 12.34mS
ARMfft cycles : 50,597 cycles


SO, I still see more than 2:1 on cycle count for the FFT in favour of arm fft

can one of you guys get a cycle count on kiss-fft ???

and..... the cortex M7 is about 2x speed on that code ......for SAME clock..
Post by glen english
kiss fft per standard codec2 code...
-O2, debug level2.
encode 3200 (20mS) frame
encode : 1388964 cycles: 8.26mS
decode : 1807440 cycles : 10.75mS
encode 1200 (40mS frame)
3006497 (17.89mS)
decode
3669321 (21.8mS)
hmm seems pretty slow
I wonder what I am doing wrong ?
Or is that inline with other's measurments ?
kissFFT512 : 155,483 cycles (versus 70,000 on the M7)
Next... arm asm on F4 (STM32F405RGT6)
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
David Rowe
2016-09-18 02:16:18 UTC
Permalink
Thanks Danilo,

That's an interesting explanation of why kiss_fft performs just as well
as the optimised ARM FFT on the M4. I also found no performance
improvement when I tried the optimised ARM FFT a few years ago.

I'm inclined to keep malloc/free in codec 2. Happy to look a patches
for alternate memory allocators if it's an itch anyone really wants to
scratch.

Danilo - I will get back to you on your other points and proposal shortly.

To the List in general - I'd like to publicly thank Danilo and the mcHF
team for the fine patches he/they have submitted over the last few
weeks. The FreeDV 1600 decoder is now around 40% faster on the STM32F4

Importantly - his suggestions have been backed by quality patches I can
easily apply and test. He has shortened my TODO list - not made it
longer. Very important for me and a fine example to anyone else who
would like to contribute to Codec 2. Thanks Danilo!

- David
Post by Danilo Beuche
Hi,
Since we finished and removed most of the easy to remove performance
hotspots with the exception of the kiss_fft calls which now contribute a
major part to the overall runtime of modem and decoder, I played around
with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs.
Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32)
the time difference is not existing if used in our mcHF code. How does
this relate to the measurements of Glen (he measure better performance)?
I believe this is due to the fact that the arm lib stores some of its
data in precomputed flash arrays. Access to flash is slow (5 wait
states), so this reduces the performance. kiss has all its data in RAM.
Since Glen did initialize the arm DSP tables in RAM, he got speed gains
on the expense of RAM. On a STM32F746 RAM this is not as much of an
issue (384K) as it is on the STM32F4 (the default MCU in the mcHF
project has 192K RAM and we have to fit the full SDR firmware RAM needs
into it this space). Speed is traded for RAM use reduction. Since have
reached our goal timewise, now memory reductions is more in focus for us
(but it should not get slower of course).
Because of that I would like to propose the following approach to keep
the code easily readable while providing efficient solutions for the ARM
1. We create an abstract interface for running fft in the codec2
sources. Initially this should closely resemble the existing kiss_fft
calls, which makes introducing the interface easy, since we may
use all the existing tests and can verify that introducing the interface
does not change a bit in the output.
2. Once validated, we can now introduce/activate the use of the arm DSP
FFT with some glue code to map between the abstract interface and the
arm DSP interface. Here again we have to validate everything is working
nicely but we will see some slight differences I assume. However, with
it we can produce reference data for step 3
3. Now we modify the existing code so that we can benefit from some nice
properties of the arm DSP fft (inplace FFT) which means this will reduce
RAM usage significantly (in relation to 192K RAM).
4. We enable optional use of RAM instead of flash in the ARM code, so
that depending on the amount of available memory you can get some extra
boost.
For that to work nicely, we have to fix some issues in the existing code
first, so here comes
0. As Glen pointed out, some of the #define constant have not so good
names, especially M (defined for 2 different purposes in defines.h and
fdmdv_internal.h) is nasty and also N in defines.h (there are some local
variables N and the stm headers also get confused by it). So we need to
change these to something unambiguous. I think Glen already suggest
names for them.
And I would like to point out, that the use of dynamic memory allocation
(malloc/free) is necessary in our mcHF case, so I would like to keep
this more or less as it is. The mcHF needs the ability to reuse the
memory for other operational modes, if FreeDV is not active. Which does
not mean I am against removing the internal use of malloc, but then it
should be possible to easily create the required data structures
"outside" the code using malloc. I.e. the use of static data structures
for anything but const data is a no go.
To support that discuss I created a draft suggestion for the interface
(attached to this mail). It is right now defined using inline code for
the sake of simplicity. This may change later, I don't think there
should be any issue with that. It essentially contains 3 functions for
complex fft and real fft each (alloc,fft,free) and the necessary data
structures.
Danilo
Post by Dana Myers
Subject: Re: [Freetel-codec2] more benching and thoughts
Date: Fri, 16 Sep 2016 12:13:05 +1000
Hi Danilo
Yeah, I guess being a very bare metal programmer from the old 128 byte
RAM days, , I dislike MALLOCs in embedded code on principal.
I'm similar, though now that we have 32kB, 64kB (or even more) RAM
in embedded chips, they're basically like the systems that malloc()
was initially built on :-)
I still don't trust C++ heap allocators in embedded applications, though.
However, because the heap usage would be deterministic, it should be
fairly safe.
I have a similar project; a 1200 baud modem + TNC stack built in a PSoC 5LP.
to do bandpass heavy-lifting and frequency response correction for the ADC.
I use CMSIS-DSP for the rest of the DSP crunching required, and, wait for it,
wrap the whole thing in FreeRTOS (9.0.0 now). I use q31_t for all the DSP,
as long as I'm careful to avoid blowing out past the +/- 1.0 range, it's probably
every bit as good as single-precision floating point. The PSoC 5LP has a Cortex-M3,
I'm running it at 80MHz.
My dynamic buffer implementation uses ...
*******************************************************************************
Take a look at the memory management routine heap2.c in freertos.c
(in fact, there are heap1,2,3,4,5 .c - a few options... try heap4, also)
-this is a much smarter memory alloc and dealloc routine that is fairly
cheap.
much better than usual brain dead malloc.
********************************************************************************
I'd recommend using that. It looks for blocks same size, existing used etc
heap_4.c and, while I've never explicitly profiled it, I've never had a reason to
suspect the allocator is misbehaving. I commit 32KB to the heap and currently
never use more than about 3KB.
I would expect the same improvements on the F4 as the F7 using the CMSIS
library. The F7 is much faster on that sort of code.
I only got rid of the FFT malloc stuff the huge stack additions are
still in there
and you could save 50% there ...
Without knowing the details of this application (I'm new here), I am quite
impressed with the quality of CMSIS-DSP, particularly in terms of exploiting
the ARM extensions.
Cheers,
Dana
-glen
Post by Daniele Barzotti
Hi Glen,
nice, would be interesting to see how much the STM32F4 gains by use of
CMSIS FFT routines.
BTW, I am not sure, but I think you mentioned the removal of malloc as
one of your changes. For us with the mcHF it would not be good to have
the memory for FreeDV code statically allocated since FreeDV is just one
operation mode of the mcHF, and we need the memory at other times for
other stuff, especially since it really eats a lot of memory (in
relation to the STM32F4 RAM sizes). Even half of it is still a lot.
Looking forward to gain some more free cycles with your work.
Regards,
Danilo
Post by glen english
Hi Danilo
yeah, you have plenty in hand.
OK so M7 and CMSIS FFT, about 2 x speed (same clock) 7.74mS (1200bps)
for decode.
Post by Danilo Beuche
H
measured 17.3ms per 40ms interval for the voice decode part only (this
is only happening once the modem is synced) and roughly 5ms of
fdmdv_demod per 20ms interval (happens all the time). Which gives us in
total some 27ms per 40ms once synced. This is about 68% load.
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
https://lists.sourceforge.net/lists/listinfo/freetel-codec2
------------------------------------------------------------------------------
glen english
2016-09-18 03:25:07 UTC
Permalink
David
Have a look at

http://www.freertos.org/a00111.html

I use heap2.c for most things... works on the premise of repeated
requests and returns of things the same size.

heap4 is good, also.



------------------------------------------------------------------------------
Loading...