Designer’s Journey: Navigating The Transition To Versal ACAP
Mercury System’s close collaboration with Xilinx during ACAP’s development allows them to bring a deployable ACAP product line to the market early – the new SCFE6931 Dual Versal AI Core FPGA Processing Board.
Ever-increasing data volumes, rising computation demands and real-time performance expectations can no longer be satisfied by traditional solutions.
With the introduction of ACAP (Adaptive Compute Acceleration Platform) technology, Xilinx has enabled an innovative approach for the next era of specialized computing. This highly dense, next-generation chip solution combines multiple types of processing elements to form a whole new category of dramatically faster devices that step beyond the current CPU/GPU/FPGA paradigm. Utilizing this technology, we can now solve the most advanced radar, cognitive EW and AI challenges – all on a single board.
ADAPTIVE COMPUTE ACCELERATION PLATFORM (ACAP) IS:
- 43x faster than today’s fastest CPUs.
- 3x faster than today’s fastest GPUs.
- Up to 20x faster than today’s fastest FPGAs.
Today’s processing challenges typically fall into one of three categories, all if which can be addressed by the Versal ACAP.
Mercury Systems has benefited greatly from close collaboration with Xilinx during the ACAP’s development, allowing them to bring a deployable ACAP product line to the market early – the new SCFE6931 Dual Versal AI Core FPGA Processing Board.
This white paper follows a Mercury System design engineering team’s journey toward ACAP development methodologies. By starting simply, our team was able to better understand the tools and technology behind the ACAP architecture before taking on more complex implementations.
The following engineer-to-engineer designer’s journey is intended to assist other development teams as they adopt ACAP design.
THE JOURNEY BEGINS
Adopting the Versal ACAP seemed challenging for our team at first. Having a primary background in traditional FPGA and DSP development, the idea of programming AIE processors using high-level languages was unfamiliar to us. In addition, we did not yet understand the available methods of defining the dataflow into and out of the AIE array. To dispel our worries, we decided to start small and build up our experience with AI.
At a high level, the AIE array is similar to a GPU in that it consists of hundreds of vector processors. Each AIE processor can perform up to eight complex multiplications per cycle and has its own memory scratch pad for temporary storage of work. Data inputs and outputs are AXI4-Streams and can flow from the programmable logic into multiple AIE processors before being output from the AIE array. These functions executed by the AIEs are called kernels1, and a single AIE can share its time between different kernels.
THE POWER OF VERSAL ACAP TECHNOLOGY IN A READY-TO-RUN, PROVEN AND TESTED PLATFORM
Jump-start development with the Model 8258 low-cost 6U VPX platform to build, run and debug applications on the SCFE6931 Dual Versal ACAP processing module. Providing power and cooling to match the SCFE6931 in a small desktop footprint, the chassis allows access to all required front-panel interfaces and the optional rear-panel connectors to support 100 GigE. Mercury’s Navigator® FPGA design kit (FDK) and board support package (BSP) complete the preconfigured development platform.
AIE Array Diagram
LEARNING BY EXAMPLE
We selected a problem to solve using AIE and created a small test application consisting of a single kernel. This kernel would perform a common DSP function: beamforming.
We began by studying the AIE architecture manual before coding the test beamformer kernel in C++. This kernel would take in multiple AXI4-Streams for element data and weights, producing an output stream of a single complex beam.
For the first design, we settled on two input streams of interleaved element samples, with another input stream for weights. These streams were continuously read into doublebuffered memory within the AIE. The initial C code for the kernel function looked like this:
We initially chose the input data width to be 64 receive elements, as this represents a common beamforming application. However, we soon discovered that routing 64 streams to a single AIE was not feasible.
As you will see in the next section, we overcame this obstacle by interleaving our element samples into two streams.
WHAT IS BEAMFORMING?
Beamforming – also referred to as spatial filtering – is a signal processing technique used in sensor arrays for directional signal transmission or reception. This is achieved by combining elements in an antenna array in such a way that signals at particular angles experience constructive interference while others experience destructive interference.
IMPLEMENTING THE DESIGN
Each AIE application consists of a dataflow graph2 that describes a set of kernels and their associated inputs, outputs and interconnections. For standalone simulation of an AIE graph, these input and output ports reference test vector files. We began testing our first application by generating test element data and weights with MATLAB®. These test vectors represented the input data that would normally flow from the programmable logic fabric.
Initial Design: Simple Beamformer Kernel Stream Ports Diagram
REVIEWING OUR INITIAL AIE DESIGN
After simulating our AIE kernel, we used the Vitis™ Analyzer tool to display the trace data generated. This timeline display allowed us to see the activity of each AIE in the array and how effectively it was being utilized.
As shown in the Vitis Analyzer screen below, our first kernel spent a substantial amount of its time idle. This is because the AIE was able to compute the output beam faster than the I/O throughput rate.
INCREASING OUR UTILIZATION
To make better use of the processor’s time, we experimented with increasing the number of beam outputs to discover how throughput would be affected.
Improved Design: Simple Beamformer Kernel Stream Ports Diagram
As shown below, the single AIE processor is now tasked more than twice as efficiently. By reusing the element data with more sets of weights to produce more beams, we greatly increased our efficiency. However, this also meant that the input throughput was reduced because we were now CPU bound.
Charting the throughput for designs with different numbers of beam outputs illustrates the trade-offs that should be considered when designing applications.
EXPERIMENTING WITH PARAMETERS
At this point in the design, the beamforming kernel received its weights from an input AXI4-Stream. Since these weights did not need to be updated frequently, we found the opportunity to further improve the kernel by using run-time parameters (RTPs). RTPs can be single values or entire arrays that are passed from either the processing system (PS) or another kernel.
Using RTPs to store weights alongside the kernel replaced the need for them to be streamed from the Programmable Logic (PL) , simplifying the design. This approach can improve design throughput by reducing the amount of data streams contending for routing resources within the AIE array.
Further Improvement: Single Kernel While Using Run-Time Parameters
So far, we have explored several example AIE kernels. But what about larger applications? To effectively use the AIE array, designers must consider how to divide their application into multiple kernels that work together.
To demonstrate this, we created a graph with 16 kernels where each of the kernels computes part of the input elements. The intermediate results are passed to the next kernel in the AIE array through a cascade path. The last kernel finishes the calculations and outputs the data to the FPGA fabric.
For the most demanding applications, designers should consider how to structure graphs so they can scale efficiently across many AIEs. The physical location of kernels and I/O interfaces is also important. A good starting point is to map the dataflow of the application, as this will guide the other aspects of the AIE design.
Input data should flow directly upward from the logic fabric through the AIE array. This is because the AIE array’s AXI4-Stream interconnect is non-symmetrical, with more paths traveling north than any other direction (see the AIE Array Diagram).
If one of the input streams is broadcast to many kernels, it will occupy more routing as it branches out to each of the destinations.
Within the application, designers should take advantage of the cascade path to forward data between kernels when possible. To transfer low-bandwidth data, designers should consider using RTPs, which can be transferred both between kernels as well as the processing system. These techniques will reduce the total number of data streams and make the application more flexible and easier to implement.
Necessity is most assuredly the mother of invention. Today’s exploding data volumes, combined with the increasing need for energy efficiency, require a new generation of processing solutions. The Xilinx Versal ACAP meets those demands. Now a single, hardened, heterogeneous silicon chip provides the computational performance of multiple devices while using much less energy.
The landscape has changed and the journey has just begun toward more complex, secure and purpose-built solutions and systems for the next generation in aerospace and defense capabilities.
The authors would like to recognize the valuable contributions and support given by Kok Lee, Berk Adanur, and Don Stickels.
About Mercury Systems
Mercury Systems (Nasdaq: MRCY) is a leading technology company serving the aerospace and defense industry, positioned at the intersection of high tech and defense. Headquartered in Andover, MA, we deliver solutions that power a broad range of aerospace and defense programs, optimized for mission success in some of the most challenging and demanding environments. We envision, create and deliver innovative technology solutions purpose-built to meet our customers’ most-pressing high-tech needs.