Introducing the OpenMV N6 & AE3 - Our first OpenMV Cams with AI Accelerators onboard that can run YOLO at 30 FPS. Click here to back our Kickstarter!

The Path to Performance - Part 2

Posted by Kwabena Agyeman on May 30, 2021

Hi everyone,

Time for the next blog post. Going to try to keep doing an update for this series every week. But first:

PureThermal OpenMV Price drop!

The PureThermal OpenMV is now $259.99! We managed to shave $30 off the BOM looking for cost savings (we wanted to get it to $249.99 but couldn't get it that low thanks to the current chip market).

We're going to be building 250 of these things for our first production run and go from there. Please back the campaign and lock in your spot now!

Yes, these blog posts are a vehicle for me to keep blasting the email list about the PureThermal OpenMV. But, I'm also writing genuinely useful content below. Maybe I'll have a demo video next week showing something cool onboard off.

Part 2 - DMA Buffer Locality

Where you put DMA buffers in RAM matters. It determines how data flows around in system buses on chip and determines what resources are under load. For example, with the PureThermal OpenMV we're able to do:

Constant Image Capture with the OV5640 at 80 MB/s
Constant Display Buffer Update (50 MB/s)
Constant Display Update at 1280x720 @ 60 Hz (111 MB/s)
Constant SPI Bus Input of the FLIR Lepton 3.5 (2.5 MB/s)
Constant SPI Bus Output for a TV Shield (10 MB/s)
Constant WiFi Output (1.25 MB/s)
Constant USB Output (1.25 MB/s)
Constant SDIO Output (12.5 MB/s)

At the same time! When we first tried to do all this at once our code fell over. DMA FIFOs overflowed, things locked up, nothing worked. But, we found the answer when looking at the system bus architecture.

The STM32F4/STM32F7 System Bus

If you dig into the STM32F427 reference manual you'll find the below system bus architecture.

When we first started developing our firmware on the STM32F427 we didn't have to worry too much about the location of the DMA buffers in SRAM. The camera was slower, we weren't using SDRAM, and the processor was simpler. So, we made no effort to locate DMA buffers optimally.

Now, here's how to look at the picture above. You'll notice there are three SRAM banks. The reason for this is that it allows three masters to read/write to all SRAM banks at the same time. The bus masters are the devices at the top of the matrix while the bus targets are the devices on the right. The bus matrix on the STM32F427 allows all masters to simultaneously read/write to all targets at the same time as long as multiple masters are not trying to share a target. Finally, the dots above show what targets masters can access. For example, if you look carefully above you'll notice that most bus masters can't access AHB1/2 peripherals - just RAM.

Moving on, even on the STM32F765 ST kept the same type of architecture:

Like the STM32F427 System Bus there's one main matrix with three SRAMs available for use. There are two SRAMs on the main system bus matrix along with the DTCM SRAM which all bus masters can access via the AHB slave port on the ARM-Cortex-M7 Processor. Our firmware originally roughly stayed the same between the STM32F4/F7 because of this.

Enter the STM32H7 - A System-on-a-Chip

Now, the STM32H7 is quite a different chip than the STM32F4/F7. While the STM32F4/F7 chips look like very high performance Microcontrollers the STM32H7 is clearly a System-on-a-Chip:

It's got three system bus matrices, with a 64-bit AXI bus domain (there are three domains because each can be shutdown to save power). What's AXI? Well, it's a split transaction bus architecture that lets masters issue read/write requests in such a way that resource locking of the bus is minimized. On the AHB Bus a master locks the bus exclusively for the time it takes to complete the read/write. If one master is doing a read which may take a long time to complete another master cannot execute a write while the bus is idle waiting for the read response. With AXI you actually have five transaction channels between each master and target for write requests, write data, write responses, read requests, and read responses. This allows a target to receive multiple read/write requests at the same time, choose how to handle them for the best performance, and respond to the transactions without blocking.

Clearly, with AXI you're not going to have a bandwidth problem on the STM32H7. It's running at 240 MHz with a 64-bit databus for 1.92 GB/s of memory bandwidth. But, you'd be wrong, because, not all bus masters are the AXI domain - some are still the in AHB domain.

The Choke Point

To link the AXI Bus domain to the AHB Bus domain ST choose to use AHB buses. There's the D1-to-D2 AHB Bus and the D2-to-D1 AHB Bus which allow bus masters to communicate across domains. These 32-bit buses run at 240 MHz for 960 MB/s of bandwidth. But, unlike AXI, AHB buses are locked when a master performs a read/write. For example, if DMA2 wants to read from SDRAM it must:

Win arbitration access to the D2-to-D1 AHB Bus.
Use the D2-to-D1 AHB Bus to send a transaction to the SDRAM.
Wait for the SDRAM to respond (might be a while - 100s of clocks)
Return the result over the D2-to-D1 AHB Bus

And... during the time above no other bus master may use the D2-to-D1 AHB Bus. If you recall from the previous blog post, DMA engines on the STM32 line of microcontrollers only have 16-bytes of onboard FIFO space. These FIFOs cannot handle reads/writes taking a very long-time to complete and not overflow if they are constantly receiving data from a peripheral. So, if you were trying to write image data from a camera to SDRAM while pulling another frame from SDRAM to send to SPI using DMA things will crash.

The Solution - Use the Architecture Features!

Back to my original observation, the chip designers at ST put SRAM blocks in different domains. This is on-purpose to solve this exact problem. DMA1/2 are designed to target peripherals and SRAM1/2/3 while BDMA is designed to target SRAM4. By locating DMA buffers in their local SRAM banks you can significantly reduce system bus congestion ensuring that the bandwidth you need is available.

So, the rule is simple. If you've got a real-time DMA transaction that you cannot back-pressure locate it's DMA buffer in the local SRAM near that DMA controller. Do this and things will just work.

Next Week - MDMA

There's another DMA controller on the STM32H7. The Master DMA controller. In the next blog post I'll explain how to use it.

Thanks for reading, that's all folks!

The Path to Performance - Part 1 May 23, 2021

The Path to Performance - Part 3 June 06, 2021

Newsletter

Subscribe to receive our blog updates

Contact Us

OpenMV, LLC
6595 Roswell Road Ste G
PMB 22900
Atlanta, GA 30328
openmv@openmv.io

Country/region

Supported payment methods

My cart