In Part 1 of this series, we looked at the new high-end FPGA families from Achronix, Intel, and Xilinx. We compared the underlying semiconductor processes, the type and amount of programmable logic LUT fabric, the type and amount of DSP/arithmetic resources and their applicability to AI inference acceleration tasks, the claimed TOPS/FLOPS performance capabilities, and on-chip interconnect such as FPGA routing resources and networks-on-chip (NOCs). From those comparisons it is clear that each of these vendors’ offerings has unique and interesting capabilities that would make them shine in particular application areas. We also inadvertently highlighted just how difficult it is to do a meaningful analysis of such complex semiconductor devices.
All three vendors – Xilinx, Intel, and Achronix – discussed our assumptions and analysis with us and provided invaluable insight for this series.
This week, we’ll tackle memory architectures, in-package integration architecture, and high-speed serial IO capabilities. Here again, we’ll see that this generation of FPGAs far outshines even their immediate predecessors, and we’ll show further evidence that these are probably the most sophisticated chips ever created. We are at a fascinating time in the history of semiconductor evolution, with Moore’s Law coming to an economic end, a new generation of AI technology and applications demanding an entirely new approach to computing, and enormous competitive stakes at play with vast new markets opening up for these amazing devices.
The real-world performance of FPGAs is just as dependent on memory architecture as on compute resources and internal bandwidth. In today’s computation environment, the data is the thing – and moving, processing, and storing that data efficiently in the flow of the computation is key. Today, the global data infrastructure spans a landscape from small, sensor-laden endpoints to the network edge, local storage and computing, back to cloud data centers with vast computation and storage resources, and then back through the whole thing to the edge again. The role of FPGAs in that round trip is enormous – with FPGAs contributing heavily to storage, networking, memory, and computation.
We should point out that Xilinx maintains that their Versal ACAP series of devices are a separate category from FPGAs – one they have dubbed “ACAP” for “Adaptive Compute Acceleration Platform.” As we understand it, the lynchpin of that claim is that Versal is aimed at a different audience from traditional FPGAs – an audience of application developers who may not have FPGA expertise, and require an interaction model that does not begin with configuring FPGA fabric with a bitstream. In fact, they point out, Versal can be booted and operated without ever configuring the FPGA fabric at all. This, combined with features such as the vector processing engines and the network-on-chip (NoC) are the basis for their argument that Versal devices are “ACAPs” rather than “FPGAs.”
For our purposes here, however, we will continue to evaluate Versal ACAP against these other, very similar FPGA families. We believe these three offerings will frequently be competing for the same sockets. Furthermore, our audience has always contained a large contingent of FPGA design experts, dating back to pre-2009 when we were known as “FPGA Journal”. We understand the motivation behind Xilinx’s marketing position. They want to attract a new market – with customers for whom “FPGA” may be an intimidating or confusing label. Xilinx took a similar strategy with their “Zynq” families of devices – referring to them as “SoCs” rather than “FPGAs”. But, “ACAP” is a harder sell, because the SoC category already existed and had a large number of competing offerings. Creating a new category of one is a tall order. We’ll see if it catches on. We are waiting for the first competitor to build a device they identify as an “ACAP.”
Each of these competing families takes a different and interesting stab at optimizing the memory architecture for the target applications they envision. Unlike conventional CPU or GPU architectures, FPGAs uniquely allow the reconfiguration of the memory hierarchy to match the task at hand. This can have a staggering impact on the throughput, latency, and power-efficiency of the resulting application. FPGA memory architectures allow us to partition our application so that each use of memory has the best trade-off between locality/bandwidth and density.
Starting at the lowest density but highest bandwidth are memory resources within the LUTs themselves. There, logic has direct, hard-wired access to small amounts of stored data, creating the most efficient path possible for data flow. All FPGA architectures have LUT-based memory as a core feature. The amount of LUT memory is roughly proportional to the LUT count, which we discussed last week. While this storage is hyper-local and delivers essentially optimal bandwidth for the associated logic, most applications have memory requirements that far outstrip the meager but precious LUT memory resources.
Moving up one level in density and down one in bandwidth, then, we have various architectures for “block” memory in the FPGA fabric. Block structures, as the name implies, are dedicated, hardened memory areas within the FPGA fabric which require data paths to span more FPGA interconnect. Each vendor has their own strategy for partitioning these on-chip memory resources. They have exhaustively modeled various types of applications and their memory needs, weighed the tradeoffs between distribution and density, and come up with a tiered approach that they feel best solves the broadest set of problems, with particular emphasis on the primary targeted application types.