The Path to Performance - Part 1
Normally I'm more focused on adding new features than blogging. But, we made a lot of changes under to hood to enable triple buffering for image capture and I'd like to share! But, first...
GroupGets updated the PureThermal OpenMV Cam campaign! There's a lot more information on the product available now. The complete feature list, schematic, board outline, and more are posted! We put a lot of love and work into the product and would love if you could back it! It's the most feature packed OpenMV Cam to date. Please watch the video of me explaining it below:
Right now the price is a little high, but, once the chip shortage eases (no joke parts are more than 2x more expensive right now) and production ramps up we hope to drive the price of this feature packed system down. Please back the PureThermal OpenMV today!
Now, time for a deep dive on the technical topic for today.
Part 1 - Memory Alignment
When we first started developing the OpenMV Cam we were working on the STM32F4 architecture. Like other microcontrollers the architecture features the Cortex-M4 at the heart of the system. The Cortex-M4 is a straight forward processor which can read/write 8/16/32-bits at a time without side affects making coding with it easy. What you program is what you get. So much so that we developed most of our original code with with the assumption that we just needed to maintain 4-byte alignment when allocating memory - or maybe 8 to support 64-bit values.
Enter DMA (Direct Memory Access)
DMA Controllers are a tricky beasts. Our original firmware until recently avoided using them. If you've been programming microcontroller firmware lately you've probably avoided using them too. They are overkill for most applications - the processor can generally do all that you need to do. But, if you've been avoiding them too like we were then you would have been leaving a massive amount of performance on the table. Using a DMA controller in your application is not straight forward. There's a big challenge you need to solve first that will trip you up indefinitely - memory alignment.
On the STM32 line of microcontrollers DMA Controllers have 16-byte deep FIFOs that can hold four 32-bit values, eight 16-bit values, or 16 bytes. The DMA controllers work by filling their FIFOs with data coming from a peripheral like the camera interface 32-bits at a time before flushing their internal FIFO to memory. Now, the DMA Controllers are not sophisticated. They work on the system bus level of the hardware. Meaning, they will not automagically abstract away complexity like the Cortex-M4 processor does to make your life easier. In particular, there are two rules you must follow when using the STM32 DMA Controllers:
- The AHB bus allows burst transactions of 4, 8, or 16 beats (these are the most efficient types of transactions as an address arbitration per element can cut a bus bandwidth in half or more). This matches directly to the four 32-bit values, eight 16-bit values, or 16 bytes that the DMA Controller's FIFO can hold. So, to get the best performance you're going to want to make your data buffer a multiple of the above values... which is 16-bytes.
- The AHB bus wraps all burst transactions at 1KB boundaries. To avoid this from happening we must ensure that we never allow a burst transaction to cross the 1KB boundary. Luckily this is pretty simple since we are always transferring 16-bytes... so we just need to ensure that our memory buffer is 16-byte address aligned and this will ensure we never cross a 1KB boundary.
If you follow the two rules above, then DMA is easy-to-use on the STM32. It will work as expected without much fuss. But... this is easier said than done though as if you've been developing lots of code without respect for these two rules then you're going to be in for a lot of work like we were when trying to turn DMA on.
Enter the Cortex-M7 and the Cache
But, the OpenMV Cam M7/H7 are powered by the Cortex-M7 which features a cache! The cache automagically makes your code run a lot faster - but, using it with DMA is challenging. Because, while it hides a lot of system complexity from you it does not play nicely with DMA hardware.
The cache on the Cortex-M7 works by reading/writing cache lines which are 32-bytes in size. Note that it can only read/write cache lines. So, anytime it reads/writes it will always be a 32-byte chunk address aligned to a 32-bytes.
Additionally, as a cache, by definition it only reads main memory when something is not already in the cache and it only writes to main memory when it has to flush a line (or lines) from the cache. So, DMA updates to main memory are invisible unless you invalidate the cache covering the memory buffer DMA is writing to forcing the cache to read the updated memory. Similarly, processor writes will be invisible to DMA unless you flush the cache to the memory buffer DMA is to read. While more complex microprocessors have cache coherency built into the hardware to handle this for you the Cortex-M7 does not so you must deal with it yourself.
Anyway, given the cache line rule, we must again extend our memory allocation requirements. Which is, memory buffers must be multiplies of 32-bytes in size and 32-byte address aligned. If you follow this rule then working with the Cortex-M7 and DMA is a breeze. Things will just work!
And... if you don't you will experience some of the most challenging bugs created by race conditions between the cache and DMA Controllers in your code.
Next Week - DMA Buffer Locality
Did you know the STM32H7 is a SoC (system-on-a-chip)? Next week we'll cover DMA buffer locality and it's affect on performance.
Thanks for reading, that's all folks!