The Path to Performance - Part 3
All right! Three blog posts in three weeks! Who knew I had the time.
Part 3 - The Master DMA Controller
Last week I talked about the importance of DMA memory buffer locality and putting the DMA memory buffers in SRAM in the same AHB matrix region as the DMA controller. However, SRAMs are limited in size. How can you move chunks of data in the SRAM buffer into a larger buffer in SDRAM? Certainty you can use the processor, but, with the STM32H7 you have access to the Master DMA controller to do this for you without loading the CPU.
The Master DMA (MDMA) controller is a high performance DMA controller on the STM32H7 with A LOT more functionality for data movement than the standard DMA controllers on STM32 Microcontrollers. In particular, it can trigger off of other DMA controllers when they finish moving data allowing it to act like a mini-processor that's interrupted to memcpy() data from one buffer to another.
How we use it on the STM32H7
Like many pieces of hardware on the STM32H7, explaining what modules can do doesn't really give any insight on how to use them. So, instead, I'll explain how our camera driver achieves 100% image capture offload for the CPU using MDMA. Buckle up!
Line Capture using DMA
There's quite a bit of complexity in our new camera image capture driver. We've really pushed it to the max this last year. But, it's pretty straight forward to explain how we used MDMA here. First, we use a DMA controller to receive lines of pixels coming from the camera. Lines are loaded into the same memory buffer at the same starting address over-and-over again.
As mentioned in previous blog posts, we have to follow all the memory alignment and data size rules for the DMA controllers here. So, the line buffer is 16-byte aligned and we are moving the image in 16-byte chunks to keep the DMA controller happy (however, per-line we only have to be 4-byte aligned but the total image must be 16-byte aligned).
That said, to enable architecturally efficient cropping when you want to crop more than 4-bytes worth of pixels per line, we tell the DCMI hardware to drop the first 1-3 bytes of each line to ensure that the first pixel of the cropped image is on a 32-bit boundary. Once this is done we can then just change the starting address of where we want to grab pixels from to crop the image with MDMA while being able to maintain 4-byte alignment which is critical for keeping performance up on the 32-bit AHB bus.
DCMI hardware also takes care of vertically cropping the image too by dropping lines before a starting line and after an ending line. So, along with the above trick image cropping is fully offloaded to the hardware.
MDMA to Frame Buffer
Next, MDMA moves each line of the image to the frame buffer. As mentioned above, to handle cropping we simply program MDMA with the number of lines it needs to move, the size of each line, and a starting address offset into the line buffer. It then takes care of the rest by triggering off each time DMA2 completes a line transfer of the image. Once the MDMA Controller is done it generates a transfer complete interrupt to let us know it's finished writing the image!
Now, the real magic with MDMA is in its memcpy() features. Image data isn't directly usable all the time. In particular, some cameras send us byte reversed RGB565 pixels that the processor would normally have to byte-un-reverse. But, the STM32H7 designers foresaw this issue and gave the MDMA controller the ability to byte-reverse, half-word reverse, and word reverse the data it's moving!
Next, sometimes we have to extract the Y channel from YUV422 images to get a grayscale image out of the camera. This takes quite a bit of processor bandwidth as it can't be done very efficiently. But, MDMA can do this too! It supports flexible source and data size increments allowing us to program it to grab one byte every two bytes to extract the Y channel from YUV422 images (YUV422 images are organized in a repeating YUYV byte pattern).
Finally, the best part of MDMA is how much smarter it is than the regular STM32 DMA controllers. Based on the line byte offset and width we optimally pick the source/destination data/increment/burst sizes to move data using the system buses as efficiently as possible.
Wrapping it all up
After the image has been fully transferred we enqueue it into our flexible frame buffer architecture (using pointers). The processor has to do this part. Then when a new image is received by the DCMI hardware we start the process all over again to receive the next frame. All this happens in the background while we're running your code. In triple buffer mode (which is the default on the OpenMV Cam H7 Plus, the OpenMV PureThermal, and the Arduino Portenta H7) we're able to constantly receive images in the background and store images to SDRAM with effectively ZERO processor overhead. Then when you call snapshot() you're just setting the frame buffer to point to the latest frame that was captured making sure that you have the most recent image (along with having to invalidate the cache where the image was placed in a 32-byte aligned and 32-byte multiple image buffer).
Anyway, thanks for reading! That's all folks!
(What about the OpenMV PureThermal? I don't have any new updates about it this week. Please back it on GroupGets though! The new high performance camera driver architecture is made possible by companies like GroupGets investing in OpenMV. Support us and GroupGets by backing the OpenMV PureThermal).