product resource whitepaper

Designer’s Journey: Navigating The Transition To Versal ACAP

Mercury System’s close collaboration with Xilinx during ACAP’s development allows them to bring a deployable ACAP product line to the market early – the new SCFE6931 Dual Versal AI Core FPGA Processing Board.

Ever-increasing data volumes, rising computation demands and real-time performance expectations can no longer be satisfied by traditional solutions.

With the introduction of ACAP (Adaptive Compute Acceleration Platform) technology, Xilinx has enabled an innovative approach for the next era of specialized computing. This highly dense, next-generation chip solution combines multiple types of processing elements to form a whole new category of dramatically faster devices that step beyond the current CPU/GPU/FPGA paradigm. Utilizing this technology, we can now solve the most advanced radar, cognitive EW and AI challenges – all on a single board.


  • 43x faster than today’s fastest CPUs.
  • 3x faster than today’s fastest GPUs.
  • Up to 20x faster than today’s fastest FPGAs.

Today’s processing challenges typically fall into one of three categories, all if which can be addressed by the Versal ACAP.

Processing challenges addressed by Versal ACAP.
In traditional RAN architectures, the BBU pool is spread out among multiple pieces of hardware and can only handle a limited number of RANs.

Mercury Systems has benefited greatly from close collaboration with Xilinx during the ACAP’s development, allowing them to bring a deployable ACAP product line to the market early – the new SCFE6931 Dual Versal AI Core FPGA Processing Board.

This white paper follows a Mercury System design engineering team’s journey toward ACAP development methodologies. By starting simply, our team was able to better understand the tools and technology behind the ACAP architecture before taking on more complex implementations.

The following engineer-to-engineer designer’s journey is intended to assist other development teams as they adopt ACAP design.

SCFE6931 Dual Versal Al Core FPGA Processing Board.


Adopting the Versal ACAP seemed challenging for our team at first. Having a primary background in traditional FPGA and DSP development, the idea of programming AIE processors using high-level languages was unfamiliar to us. In addition, we did not yet understand the available methods of defining the dataflow into and out of the AIE array. To dispel our worries, we decided to start small and build up our experience with AI.

At a high level, the AIE array is similar to a GPU in that it consists of hundreds of vector processors. Each AIE processor can perform up to eight complex multiplications per cycle and has its own memory scratch pad for temporary storage of work. Data inputs and outputs are AXI4-Streams and can flow from the programmable logic into multiple AIE processors before being output from the AIE array. These functions executed by the AIEs are called kernels1, and a single AIE can share its time between different kernels.


Jump-start development with the Model 8258 low-cost 6U VPX platform to build, run and debug applications on the SCFE6931 Dual Versal ACAP processing module. Providing power and cooling to match the SCFE6931 in a small desktop footprint, the chassis allows access to all required front-panel interfaces and the optional rear-panel connectors to support 100 GigE. Mercury’s Navigator® FPGA design kit (FDK) and board support package (BSP) complete the preconfigured development platform.

Model 8258 6U VPX platform for Dual Versal ACAP

AIE Array Diagram

AIE array diagram.
Each AIE processor has two physical stream inputs and outputs, however, these interfaces can be multiplexed to accommodate a higher number of “virtual streams.”


We selected a problem to solve using AIE and created a small test application consisting of a single kernel. This kernel would perform a common DSP function: beamforming.

We began by studying the AIE architecture manual before coding the test beamformer kernel in C++. This kernel would take in multiple AXI4-Streams for element data and weights, producing an output stream of a single complex beam.

For the first design, we settled on two input streams of interleaved element samples, with another input stream for weights. These streams were continuously read into doublebuffered memory within the AIE. The initial C code for the kernel function looked like this:

ACAP code diagram of input streams.

We initially chose the input data width to be 64 receive elements, as this represents a common beamforming application. However, we soon discovered that routing 64 streams to a single AIE was not feasible.

As you will see in the next section, we overcame this obstacle by interleaving our element samples into two streams.


Beamforming – also referred to as spatial filtering – is a signal processing technique used in sensor arrays for directional signal transmission or reception. This is achieved by combining elements in an antenna array in such a way that signals at particular angles experience constructive interference while others experience destructive interference.

beamforming diagram.
naval ship.


Each AIE application consists of a dataflow graph2 that describes a set of kernels and their associated inputs, outputs and interconnections. For standalone simulation of an AIE graph, these input and output ports reference test vector files. We began testing our first application by generating test element data and weights with MATLAB®. These test vectors represented the input data that would normally flow from the programmable logic fabric.

Initial Design: Simple Beamformer Kernel Stream Ports Diagram

Beamformer kernal stream ports diagram.
Beamformer Kernel Stream Debug Diagram.


After simulating our AIE kernel, we used the Vitis™ Analyzer tool to display the trace data generated. This timeline display allowed us to see the activity of each AIE in the array and how effectively it was being utilized.

As shown in the Vitis Analyzer screen below, our first kernel spent a substantial amount of its time idle. This is because the AIE was able to compute the output beam faster than the I/O throughput rate.

Vitis Analyzer Tool.


To make better use of the processor’s time, we experimented with increasing the number of beam outputs to discover how throughput would be affected.

Improved Design: Simple Beamformer Kernel Stream Ports Diagram

Beamformer kernel stream ports diagram.

As shown below, the single AIE processor is now tasked more than twice as efficiently. By reusing the element data with more sets of weights to produce more beams, we greatly increased our efficiency. However, this also meant that the input throughput was reduced because we were now CPU bound.

AIE processor doubled processing time.

Charting the throughput for designs with different numbers of beam outputs illustrates the trade-offs that should be considered when designing applications.

AIE processor input rate.
AIE processor output rate.


At this point in the design, the beamforming kernel received its weights from an input AXI4-Stream. Since these weights did not need to be updated frequently, we found the opportunity to further improve the kernel by using run-time parameters (RTPs). RTPs can be single values or entire arrays that are passed from either the processing system (PS) or another kernel.

Using RTPs to store weights alongside the kernel replaced the need for them to be streamed from the Programmable Logic (PL) , simplifying the design. This approach can improve design throughput by reducing the amount of data streams contending for routing resources within the AIE array.

Further Improvement: Single Kernel While Using Run-Time Parameters

Single Kernel While Using Run-Time Parameters.
Single kernel beamformer code.


So far, we have explored several example AIE kernels. But what about larger applications? To effectively use the AIE array, designers must consider how to divide their application into multiple kernels that work together.

To demonstrate this, we created a graph with 16 kernels where each of the kernels computes part of the input elements. The intermediate results are passed to the next kernel in the AIE array through a cascade path. The last kernel finishes the calculations and outputs the data to the FPGA fabric.

Multi-Kernel Graph


For the most demanding applications, designers should consider how to structure graphs so they can scale efficiently across many AIEs. The physical location of kernels and I/O interfaces is also important. A good starting point is to map the dataflow of the application, as this will guide the other aspects of the AIE design.

Input data should flow directly upward from the logic fabric through the AIE array. This is because the AIE array’s AXI4-Stream interconnect is non-symmetrical, with more paths traveling north than any other direction (see the AIE Array Diagram).

If one of the input streams is broadcast to many kernels, it will occupy more routing as it branches out to each of the destinations.

Within the application, designers should take advantage of the cascade path to forward data between kernels when possible. To transfer low-bandwidth data, designers should consider using RTPs, which can be transferred both between kernels as well as the processing system. These techniques will reduce the total number of data streams and make the application more flexible and easier to implement.


Necessity is most assuredly the mother of invention. Today’s exploding data volumes, combined with the increasing need for energy efficiency, require a new generation of processing solutions. The Xilinx Versal ACAP meets those demands. Now a single, hardened, heterogeneous silicon chip provides the computational performance of multiple devices while using much less energy.

The landscape has changed and the journey has just begun toward more complex, secure and purpose-built solutions and systems for the next generation in aerospace and defense capabilities.


The authors would like to recognize the valuable contributions and support given by Kok Lee, Berk Adanur, and Don Stickels.

About Mercury Systems

Mercury Systems (Nasdaq: MRCY) is a leading technology company serving the aerospace and defense industry, positioned at the intersection of high tech and defense. Headquartered in Andover, MA, we deliver solutions that power a broad range of aerospace and defense programs, optimized for mission success in some of the most challenging and demanding environments. We envision, create and deliver innovative technology solutions purpose-built to meet our customers’ most-pressing high-tech needs.

Visit Mercury Systems Website

manufacturers rep

Natalie Myers

Inside Sales Administration

Natalie Myers joined Vic Myers Associates in September 2021 and is excited to be part of the team. She received her bachelor’s in business administration from the University of Phoenix and prior to her position as Inside Sales Administration she worked in the Hospitality Industry for over 15 years as a Senior Sales Administrator. In her free time, she enjoys spending time with her husband, daughter, family and friends along with watching sporting events, traveling, hiking and cheering on her daughter in dance and basketball! Natalie is located in our Arizona office.


Can’t find what you’re looking for?