AI Transformer Models Enable Machine Vision Object Detection

A system-level view of machine vision will be essential to move the technology forward.


The object detection required for machine vision applications such as autonomous driving, smart manufacturing, and surveillance applications depends on AI modeling. The goal now is to improve the models and simplify their development.

Over the years, many AI models have been introduced, including YOLO, Faster R-CNN, Mask R-CNN, RetinaNet, and others, to detect images or video signals, interpret objects, and make appropriate predictions. In recent years, AI transformer models have emerged as better object detection solutions, and it warrants a look into how they work and what advantages they have over traditional models.

Object detection in machine vision
The human eye can see an object and quickly determine its size, color, and depth. In addition, the brain can tell what the objects are — a person in motion, an animal standing still, or a fire hydrant — by filtering out the background visuals and focusing only on the foreground objects. For example, a driver will focus on a traffic light and any nearby pedestrians, but will ignore scenery such as trees and mountains. Ideally, an AI model in this situation would act similarly. It has to capture the important target objects and filter out the background, as well as classify the objects. The AI model must predict what the perceived objects are based on its training.

“Today, machines can ‘see’ with an image sensor and lens that feed to an SoC with a special image signal processing (ISP) block that helps clean the image for machine vision needs,” said Alexander Zyazin, senior product manager in the automotive line of business at Arm. “The output of this ISP block is fed to either an accelerator or general-purpose CPU for further pre- and post-processing of images.”

Design requirements vary quite a bit depending on the use-case. “In surveillance and factory scenarios, machine vision can be used for use-cases related to people counting for better planning purposes or to spot defects in factory production lines,” Zyazin noted. “In automotive, machine vision is used today in advanced driver assistance systems (ADAS), where it provides inputs from a few sensors to single functions like automatic emergency braking or lane-keep assist.”

Technology advancements are paving the way for autonomous vehicles where all the inputs are provided by sensors and no human input is needed. “That, however, will require many sensors around the car, generating huge amounts of data that must be managed and processed with very low latency,” he said. “It is a highly complex system design from both a hardware and software perspective.”

Transformer architecture
In recent years, new transformer models, including Oriented Object Detection with Transformer (O2DETR research paper 2021), DEtection TRansformer (DETR 2020 from Meta), and others have been introduced. The transformer method has a number of advantages over traditional models like Faster R–CNN, including a simpler design. [This article will use Meta’s DETR 2020 to illustrate how the transformer model works. DETR training codes also are available to developers.]

Fig. 1: DETR transformer model compares its prediction with the ground truth. When there is no match, it would yield a “no object.” A match would validate an object. Source: “End-to-End Object Detection with Transformers,” Facebook AI

Fig. 1: DETR transformer model compares its prediction with the ground truth. When there is no match, it would yield a “no object.” A match would validate an object. Source: “End-to-End Object Detection with Transformers,” Facebook AI

Most object detection models make initial predictions, then fine-tune them to produce the final predictions. DETR uses single pass, end-to-end object detection with transformer encoding and decoding. The two key DETR components are, (1) a set of predictions loss which forces matching between prediction and the ground truth, and (2) an architecture that predicts a set of objects and models the relation among the objects. Ground truth refers to the actual situation on the ground as shown the picture on the left of figure 1. In this case, it is two separate birds of the same kind. Without “checking back” to the ground truth, a poorly designed algorithm may end up predicting two different birds or a bird with two heads.

Fig. 2: DETR transformer model. Source: “End-to-End Object Detection with Transformers,” Facebook AI

Fig. 2: DETR transformer model. Source: “End-to-End Object Detection with Transformers,” Facebook AI

The human brain recognizes an object by processing information from an image based on prior knowledge. Machine vision has to learn everything and convert an image into digital data. As shown in figure 2, convolutional neural networks (CNNs) are commonly used to process the data. DETR uses the conventional CNN and backbone to get its data. Then it sends the data through a transformer encoding and decoding process. Finally, the data will go to a shared feed-forward network (FFN) that predicts an object detection or a “no object.”

Instead of processing these anchor boxes in sequence, DETR takes an end-to-end approach, processing data in parallel. Simply put, DETR looks at the whole picture and starts making predictions. Then it compares the small patches with the ground truth. If DETR “sees” a bird’s head and finds the same in the ground truth, then it knows it has a match, as shown in the yellow boxes to the right of figure 1. Otherwise, it will yield a “no object” as shown in the green box to the right of figure 1.

Additionally, DETR can process overlapping objects without “anchor boxes” or “non-maximum suppression.”

Anchor boxes are used in traditional object detection models. To zero in on the objects of interest, the algorithm generates boxes around them. Later, they will be used as reference points for size and location prediction.

What happens if there are multiple overlapping objects? Assume the two birds are standing very close to each other with one bird blocking part of the second bird. A process called non-maximum suppression is used to select and predict with maximum confidence two separate birds while suppressing all other predictions.

Traditional AI models use anchor boxes and non-maximum suppression to process the information. Bypassing these steps enables DETR to be more efficient than traditional models.

AI everywhere, but optimized for different applications
Object detection using machine vision requires an AI model/algorithm to run on top of AI chips, FPGAs, or modules. Together, these are commonly referred to as an “AI engine.” After first being trained, the AI model can then be deployed to run on the appropriate hardware to make predictions and/or decisions, commonly known as “inference.” Ensuring hardware development can keep up with new AI models’ innovation is important.

“If all we need to do is to detect objects, a non-transformer model such as YOLO may be sufficient,” explained Cheng Wang, CTO and senior vice president for software and architecture at Flex Logix. “But we are going into a space that is changing rapidly. The transformer model, started three years ago for classification and detection purposes, is now a requirement for generative AI and generative AI vision. All these operations are things we traditionally have not accounted for in the previous AI hardware or the AI chip.”

Further, it is never enough to have AI hardware that runs a benchmark well, because benchmarks are five years old and software models are changing every few months. For this, Wang said AI hardware such as an eFPGA is needed. It is software-adaptable to keep up with the latest transformer models, which makes it flexible.

“In other words, it is not enough to have great performance today,” he said. “You will need to future-proof your design.”

And in so many areas, uses of various types of AI are on the rise.

Looking at the use of AI in the endpoint and edge computing, Sailesh Chittipeddi, executive vice president, general manager of embedded processing, digital power and signal chain solutions group at Renesas noted during a recent panel discussion at Semicon West, “75% of all the data that’s generated by 2025 is going to come from the edge and the endpoint of the network. It’s not being generated in the cloud. So despite all the hype that you hear, most of the activity actually when it comes to AI occurs on the edge of the endpoint. Another interesting statistic put forth in was that 90% of the data that goes into the enterprise from all these devices actually gets discarded. So, there’s dark data. Where’s the first point where you can actually intercept the data that’s being produced to make it useful? That’s the edge of the endpoint. It’s really about the ability to predict what happens at the edge of the endpoint of the network, and what makes a tremendous amount of difference.”

In thinking about compute, it’s typically about microcontrollers, microprocessors, CPUs, GPUs, although the latest buzz is all about GPUs and what’s happening with GPT3 and GPT4, and what comes beyond that.

“But remember, those are large language models,” Chittipeddi said. “Most datasets don’t require such tremendous processing power. The data that’s required is a lot less so in the edge of the endpoint, and what typically ends up happening is there is a need to process data quickly with very low latency. Latency, security, the ability to be able to process the data locally, and to be able to make it actionable — that’s the first point at the edge.”

Put in perspective, processing is being distributed well beyond its traditional markets, and the amount of data generated by AI — and the need for faster results — are key to this shift.

“The market has been very focused on traditional applications, like networking, PCs, and ERP, and those markets will continue to grow, of course,” said Alex Wei, vice president of flash marketing at Winbond. “But people are looking for new applications, too, and those new applications will really lead us to the next era. That’s why NVIDIA is generating so much business with AI, and why you see AMD following with its own GPUs. These new applications require more components, and more density will be needed for everything. AI is like mapping information in your brain. But if you’re driving a car and you see people walking on the street, you have to try to ignore them to bypass them. That’s neural learning, and it consumes a lot of memory. And that’s just the beginning.”

Fig. 3: DETR deployment with InferX compiler. DETR is broken down into 100 layers. The InferX compiler will automatically maximize fast SRAM accesses and minimize slow DRAM accesses and generates the configuration bits for running each of the layers. Source: Flex Logix


Fig. 3: DETR deployment with InferX compiler. DETR is broken down into 100 layers. The InferX compiler will automatically maximize fast SRAM accesses and minimize slow DRAM accesses and generates the configuration bits for running each of the layers. Source: Flex Logix

Machine vision is another key technology, and today AI and machine vision interact in a few ways. “First, machine vision output is fed to an AI engine to perform functions such as people counting, object recognition, etc., to make decisions,” said Arm’s Zyazin. “Second, AI is used to provide better quality images with AI-based de-noising, which then assists with decision-making. An example could be an automotive application where a combination of AI and machine vision can recognize a speed limit sign earlier and adjust the speed accordingly.”

But what happens with an autonomous driving situation, for example, if the AI model received conflicting visual signals from defective sensors? The best rule is to err on the safe side.

Thomas Andersen, vice president for AI and machine learning at Synopsys, said that in this situation, it depends on the actual application and its criticality toward system failure. “For that reason, multiple systems need to be used to double and triple check information. If a conflict occurs, the decision may be difficult. As an example, for a self-driving car, one may always err on the side of caution and automatically brake if, say, the radar sensor detects an object, yet the camera does not. At the same time, this so-called ‘phantom braking’ can cause accidents, as well. One should always remember that there will never be a perfect solution, and that humans make many mistakes, as well.”

While AI models are improving overall, the importance of accuracy in object detection and prediction by AI should never be overlooked.

“Just like any application, the acceptable false positive rate depends on the application,” said Amol Borkar, director of product management, marketing, and business development for Tensilica Vision and AI DSPs at Cadence. For a consumer application, incorrectly recognizing a person as a couch is not critical. However, a misclassification of a pedestrian in an automotive application or incorrectly diagnosing a medical condition can be both critical and even fatal. This is more of an AI/classification/detection problem. AI advancements have become more accurate in automatically recognizing complex patterns in imaging data and providing quantitative, rather than qualitative, assessments of radiographic characteristics.”

While Borkar sees AI improving a lot of things, he acknowledges it does add many more compute requirements on the platform, such as crunching a great numbers of convolutions and neural network layers. “For the AI based models to work well, a large amount of synthetic data is needed to train and validate the models. Going a step further, modifying the perception stack to consume event camera data may provide hyper-sensitivity to the most minimal of motions compared to traditional rolling/global shutter-based sensors. This may improve system accuracy and is broadly applicable. As is the case for any AI model to work well, this approach will need a lot of data to train or validate before it’s ready for primetime.”

Security concerns
Good data is essential for a good result, and safeguarding that data and the systems to process and store it are critical, as well.

Machine vision systems need to be secured at all times, said Ron Lowman, strategic marketing manager at Synopsys. “Security is imperative in the case of AI versus AI. Hardware threat analysis used to be based on bad actors and their threat vectors, but AI can multiply the attack vectors and the number of devices attacked — making security within everything necessary. For years, security was done in software because it was cheaper. But software-only security clearly isn’t enough, so we’ve seen an uptake of both required security standards and implementations of hardware root of trust IP. Good examples are PCIe and Bluetooth. In the case of Bluetooth, there are voluntary standards to encrypt the data, but nobody did it because there was a cost associated. Slowly, the industry is improving this situation. In the case of PCIe, a new standard was adopted to introduce security into the communications interface. In a short amount of time, this has driven a large number of companies to adopt PCIe IDE, and we see this quickly transforming the entire interface IP requirements moving forward.”

Andy Nightingale, vice president of product marketing at Arteris, agreed. “Security is essential in any technology application, and machine vision is no exception. Machine vision systems often involve sensitive data and processes, such as surveillance footage, medical imaging, or autonomous vehicle control, making security particularly critical.”

Nightingale pointed to four areas where security is essential in machine vision applications:

  • Data privacy. Machine vision systems often process large amounts of data, including sensitive personal or commercial information. It’s essential to protect this data from unauthorized access or disclosure. This can be achieved through encryption, access control, and data anonymization.
  • System integrity. Machine vision systems can be vulnerable to attacks that manipulate or disrupt their operation. It’s essential to protect the system components and data from tampering or hacking attempts. This can be achieved through secure boot, system hardening, and intrusion detection.
  • Authentication. Machine vision systems often rely on sensors, cameras, and other devices subject to spoofing or impersonation attacks. Ensuring these devices are authenticated is essential, and the system can detect and prevent unauthorized access. This can be achieved through biometric authentication, device certificates, and network segmentation.
  • Compliance. Machine vision systems may be subject to regulatory or industry-specific requirements related to security and privacy. Ensuring that the system design and operation comply with these requirements is essential. This can involve techniques such as risk assessment, audit trails, and data retention policies.

“Security should be addressed throughout the SoC design using industry standards such as Platform Security Architecture (PSA), and through end device deployment and operation,” Nightingale added. “By implementing appropriate security measures, machine vision systems can be used effectively while protecting the data, methods, and individuals involved.”

Looking ahead
As AI models continue to evolve, they will become more efficient, as in the case of emerging transformer models. Developers will need to balance software and hardware in future designs. Many factors, including flexible hardware, management of conflicts, accuracy, and security will need to be included in the design considerations.

“For future architectures there will be a system-level view of machine vision,” said Synopsys’ Lowman. “Certain tradeoffs will need to be considered — for instance, system costs, memory availability in a disaggregated architecture, or memory bandwidths both within and off chip, how many processors, what type of processors for different stages, bit widths within each AI stage, and a whole host of other parameters. These can only be optimized via sophisticated tools and configurable and optimized IP, be it memories, interface, security or processor IP.”

In addition, machine vision will continue to expand into new applications as new AI and generative models become available.

“There are a few main directions for machine vision, including cloud computing to scale deep-learning solutions, automated ML architectures to improve the ML pipeline, transformer architectures that optimize computer vision (a superset of machine vision), and mobile devices incorporating computer vision technology on the edge,” Synopsys’ Andersen said.

— Additional reporting by Ann Mutschler and Ed Sperling.

Related Reading
Machine Vision Plus AI/ML Adds Vast New Opportunities
But to fully realize its potential, MV must boost performance and keep pace with changing security and market needs.
How Much AI Is Really Needed?
Performance depends on the application it is being applied to.


Cas Wonsowicz says:

Very informative article! Thank you.

Leave a Reply

(Note: This name will be displayed publicly)