At the other end of the AI computing spectrum are inference tasks, some of which are computed rather quickly in smaller neural network architectures. Furthermore, with a highly optimized, highly specific AI model trained against a carefully chosen dataset, the level of compute required for inference can be greatly reduced to the point where reliance on the cloud is not necessary. Much of the recent work in the software world and in the research literature has focused on optimizing neural network architectures for very specific tasks so that inference and training tasks are accelerated.
While AI has long been the domain of software developers, the electronics industry has worked to move AI compute capabilities onto embedded systems with unique chipsets, model optimization, and even unique transistor architectures that mimic analog circuits. Today and going into the future, the focus has and will continue to shift towards performing AI tasks on end-user devices with powerful embedded processors. FPGAs are some of the best contenders for implementing AI compute on embedded devices, but without the need for unique custom silicon.
Challenges in Embedded AI
Moving AI tasks away from the data center and onto an end device has the potential to enable a whole new range of computing applications. Industries like robotics, security, defense, aerospace, healthcare, finance, and many more are implementing AI compute capabilities. However, new products in these areas continue to require a constant connection back to the cloud, or at least to an edge server, to implement the basic data preprocessing, inference, and training steps required in any AI-capable system.
AI is High-Compute
AI processing is computationally intensive, therefore many production-grade services involving AI require training and inference in a data center. These existing services and applications are largely software driven, providing services and experiences to users through a platform or end devices over the internet. Portions of these services are very high compute, particularly training, although inference in some tasks can involve comparably large datasets processed in parallel.
AI computation on a small device can also require significant computing power, especially when adapted to applications like vision and sensor fusion. Embedded AI capabilities have recently become available by effectively attempting to mirror the compute architecture used in data centers and miniaturize it on an end device, or by adding a single accelerator chip that provides small-scale AI training and inference capabilities. GPU products from semiconductor vendors and small accelerators have been available for some types of AI applications, but these can be very bulky and inefficient (for GPUs) or highly specified (for accelerators).
Real-Time AI Requires Low Latency
Some AI applications require low latency, so they can’t perform inference in the data center. The time required to transmit data from the end device, have it processed in the cloud, and return results to the end device, may be too long for some advanced application areas. Similarly, tasks like training against large datasets, such as those that might be aggregated from a network of distributed devices, may take too long to perform on the end device and are best done in the cloud. Those tasks that are time-critical should be performed on the end device that act on the results from an AI inference model.
To solve this problem, a cloud or edge connection can be relied on for higher-compute training tasks involving very large datasets, while faster inference tasks should be performed on an end-user device or embedded system. For these systems, getting production-grade vision capabilities to run in real time on an end device has been challenging with current chipsets. An alternative system architecture is needed for time-critical inference and data pre-processing on an embedded device.
Separating Inference and Training
Between inference and training, inference is a lower compute task that only requires a single input dataset. Pre-processing may be required on captured data or collected signals, but this is generally much less computationally intensive than inference. As a set of successive multiply-add operations, simple neural network models can be implemented on an embedded processor (MCU/MPU) and used for inference computation within reasonable timeframes.
In contrast, training is much more data intensive, requiring significant compute resources when implemented in production-grade AI-capable systems. Even for a small dataset and small neural network architecture, the computational expense of training can be difficult for some embedded chipsets to implement quickly. The compute resources required for training might be available in the cloud, but not on every end-user device. The desire for an all-in-one solution to training and inference at the edge has helped drive development of accelerator chips, GPU solutions, and cloud infrastructure to support on-device AI.
The Embedded AI Landscape
All systems that implement AI capabilities need to balance the need for on-device compute and access to peripherals.
Embedded AI Chipset Options
The current range of chipsets that can support certain embedded AI tasks with acceptable latency looks very familiar to most engineers. These include:
Microcontroller (MCU)
These mainstays of embedded computing are suitable for some AI inference tasks and training against smaller datasets. Some semiconductor vendors provide developer resources and libraries that allow model quantization to speed up simpler inference tasks.
Microprocessor (MPU)
These processors scale up the capabilities available on MCUs to allow more complex pre-processing and training against larger datasets as long as the system architecture can support it. Inference may be faster on these devices as well thanks to faster clock speeds.
AI Accelerator
These chips are highly specific and will implement a compute architecture that is optimized for specific AI inference tasks. Accelerators are available for inference with short voice streams or still images as long as a trained neural network model is available. Parallelization is limited with these devices, which may require multiple accelerators in the same system.
Graphics Processing Unit (GPU)
These processors are often used for AI training and inference in data center settings, or in rugged embedded computing settings such as in industrial and aerospace systems. Smaller GPUs are available that can be deployed on an embedded device, most notably on a single-board computer or module.
Field-Programmable Gate Array (FPGA)
These programmable logic devices allow implementation of a totally custom compute architecture. AI-specific accelerator blocks can be implemented that are optimized for specific AI tasks. The compute architecture in these devices can also be highly parallelized, something which is not possible in other devices.
Among these devices, FPGAs are the most adaptable, most customizable options for incorporating AI capabilities into an embedded system. The industry is now realizing this, and new MCU/MPU components that incorporate an FPGA-based processor block are being produced by major semiconductor manufacturers. These additional processor blocks are being added specifically to address the demand for AI capabilities on an embedded device.
The next natural step for a systems designer is to just use an FPGA as the main system host controller, where everything needed to run a system is instantiated in the FPGA. A similar option is to place an FPGA as an additional component, which can then be used as an accelerator and interface controller while the main chipset can be a conventional MCU/MPU. While the FPGA development effort can be significant for some designers, FPGAs provide many benefits that are not found in other processor architectures.
Benefits of FPGAs in AI Systems
In each of the above application areas, the use of an FPGA is providing very specific benefits in an edge compute application:
- Fully reconfigurable logic allows repeated updating of logic architecture to fully optimize AI inference and training
- Optimized logic architecture minimizes power consumption by eliminating unused interfaces and repetitive add-multiply-shift operations in traditional architectures
- Highly parallelizable logic can be implemented to speed up AI inference, even with very large input data sets
- Vendor IP can be used to instantiate multiple high-speed interfaces that receive sensor data or digital data from other components
- AI can be implemented on top of a standard SoC core like RISC-V, which can support an embedded OS and user applications
- FPGA footprints can be comparable to or smaller than traditional processors or AI accelerators
- Universal configurability of FPGAs reduces supply chain risk and allows instantiation of external components in the system logic
For AI applications built on top of FPGAs and vendor IP, systems designers can deploy a new AI-capable system quickly with a fully optimized, fully reconfigurable architecture. Continuous system reconfiguration is essential in AI as neural networks are continuously retrained and redeployed to the embedded processor. By continuously optimizing the system’s logic architecture, power consumption and processing time can be kept as low as possible. This is not possible with accelerators that have a static logic architecture.
Where FPGAs Excel in AI
For AI inference tasks on the end device, or as an acceleration element in a data center or another component, FPGAs are being used in a variety of systems and application areas:
Application | Characteristics |
---|---|
Vision systems | FPGAs are an excellent replacement for GPUs in vision systems; they can implement tensor computations in parallel, but in a smaller package with less power usage. FPGAs can also instantiate the required vision interfaces, DSP, and embedded application directly in silicon, whereas a GPU requires an external processor and high-bandwidth PCIe interface to receive vision data. |
Sensor fusion | Because FPGAs can have high I/O counts for multiple digital interfaces, they are a preferred platform for sensor fusion. FPGA-based systems can capture multiple data streams, implement on-chip DSP, and aggregate data into an on-chip inference model as part of a larger application. This is all done in parallel, making it much faster than MCU/MPU computation. Compared to GPUs, this happens on a much smaller footprint with less power. |
Interoperable systems | Industrial systems and mil-aero are two areas where interoperability between diverse system elements can be achieved with FPGAs. With FPGAs being reconfigurable and allowing customization of interfaces, they can be used as the primary processor element in distributed systems requiring interoperability. This also allows FPGAs to be used in legacy systems that cannot be easily upgraded with newer components; the FPGA is simply adapted to interface with the existing system. |
Wearables and mobile | Small wearables and mobile devices are another area where system miniaturization is at odds with increasing feature density. FPGAs are advantages here as they allow feature density to be increased with new AI capabilities that are not possible on MCU-based systems. These systems may also not have room for an AI accelerator, but this could be implemented with an FPGA as the main processor. |
5G and telecom | It is often said that 5G will be a great enabler of embedded AI, but low-latency AI-enabled applications requiring 5G connectivity can only be implemented with an appropriately fast chipset. FPGAs can play a role here as a heavily optimized accelerator element deployed in the data center, at the edge, or on an end-user’s device. A reconfigurable AI accelerator like an FPGA would be key to service delivery demanded by 5G users. |
Add-in acceleration | FPGAs have already been used as dedicated AI acceleration elements in data centers and edge servers as add-in modules. The same functionality can be implemented on smaller embedded devices with SoC IP provided by some semiconductor vendors. Further acceleration is also possible at the firmware or model level with TensorFlow Lite and open-source libraries. |
FPGAs can be used in multiple ways in the above application areas, such as the primary processor in smaller devices or as a dedicated accelerator for larger systems. For legacy systems, it’s possible for FPGAs to be used as an add-in element that provides the required AI capabilities with minimal system overhead. From the application areas listed in this table, the main advantages of FPGAs are reconfigurability, customization, and I/O count with high-bandwidth options available.