AI-boosted vision processing and image sensors are providing strong building blocks for image system processing.
Vision systems are rapidly becoming ubiquitous, driven by big improvements in image sensors as well as new types of sensors.
While the sensor itself often is developed using mature-node silicon, increasingly it is connected to vision processors developed at the most advanced process nodes. That allows for the highest performance per watt, and it also allows designs to incorporate AI accelerators using AI pre-trained models while still being small enough and cool enough to be used in AR/VR headsets, mobile phones, and automotive in-cabin sensing, where multiple cameras frequently work together.
“Ten years ago, there were no widely available processors targeting computer vision. Today, there are dozens,” said Jeff Bier, general chairman of the Embedded Vision Summit, founder of the Edge AI and Vision Alliance, and president of BDTI. “This matters because computer vision algorithms are very suitable to acceleration using parallel processing. As a result, processors incorporating specialized architectures are easily able to achieve 100X better performance and efficiency than general-purpose processors. This huge boost in processor efficiency has made it feasible to deploy computer vision in thousands of new applications.”
In the past, all of this had to be developed from scratch. But these systems are maturing. Today, semiconductor and AI companies are offering platforms with pre-established vision processing models and trained data sets that developers can use as a basis for new systems. This is evident in the medical field, for example, where computer vision is being used to help radiologists interpret X-rays.
“The advent of AI allows for the understanding of the image and being able to start to take load off of people, such as radiologists, to make them more productive,” said Sam Fuller, senior director of marketing at Flex Logix. “There’s a lot of work in the area of algorithm development, and taking that and converting it into a product.”
“Adoption of AI/ML and with edge computing is gaining momentum, to minimize the amount of video data transmission, thus reducing power consumption and improving system efficiency,” said Ashraf Takla, founder and CEO of Mixel.
Smaller, smarter, more efficient cameras are everywhere. Vision systems run the gamut from cameras in mobile phones to industrial automation, automotive, medical, security, surveillance, drones, robotics, AR/VR, much more. With all the use cases, the demand is healthy for vision and image system components.
The vision processing semiconductor market is broken into vision processors and image signal processors (ISP). Compound annual growth rates for both of them is projected to be 7%, according to Mordor Intelligence. Asia-Pacific will see the highest demand, followed by North America and Europe. STMicroelectronics, Texas Instruments, Sigma Corp., Semiconductor Components Industries, and Fujitsu are some of the larger semiconductor companies in the market, but a lot of old and new players are in the vision processing market. Flex Logix, Ambarella, SiMa.ai, Hailo.ai, and Brainchip are making strides. And some of the EDA companies are providing vision-and-AI-enabled IP.
The demand for image sensors is also undeniable, but the CMOS may hit a temporary slowdown in demand thanks to a slump in smartphone demand, IC Insights predicts.
Fig. 1: Demand for CMOS image sensors are expected to slow because of a slump in smartphone and portable computer sales now that the COVID work from home demand has eased. Source: IC Insights
Meanwhile, the global image sensor market is forecast to grow at a CAGR of 8.43% from 2022 to 2028, reaching $30.12 billion by 2028, according to SkyQuest Technology Consulting. The top six companies with market share in image sensors are Sony, Samsung, OmniVision, STMicroelectronics, ON Semiconductor, and Panasonic.
New image sensors
In electronic vision systems, the image sensor converts light into electrical signals or bits using optics, pixels, and photo sensitive elements (a photodiode or photogate). A data pipe then delivers the signals to a processor, where the vision processing turns the signals into a digital image.
Image sensors have a layer of optics right on the sensor itself to focus the light, but another more sophisticated layer of optics can be added on top, depending on the use case. “What that means is basically the sensor just sends the raw bare data and then there is a host processor. For example, a Qualcomm-type or Ambarella-type processor will do all the backend processing, converting to YUV, JPEG for video recording,” said Devang Patel, marketing director of IoT/emerging segment at OmniVision Technologies.
Two common image sensor types, the CMOS (complementary metal oxide semiconductor) and CCD (charge coupled device) convert light into an electric charge and process it into electronic signals. The CCD uses use high-voltage analog circuits, and as described by TEL, it’s pixel array that has a bucket-brigade style of moving the charge into the top level row, which serves as a readout register. The readout register then outputs the electric charge to an off-sensor processor.
CCDs come in two flavors, the frontside illumination (FSI) and the backside illumination (BSI) sensors. The difference is where the photodiode substrate is in the sensor.
“All the high-performing sensors are pretty much all BSI,” said OmniVision’s Patel.
Fig 2.: A diagram of BSI and FSI image sensors. Source: Cmglee (Own work, via Wikipedia/CC BY-SA 4.0)
The CMOS sensor is cheaper to produce than a CCD because CMOS can be made on existing semiconductor manufacturing equipment. CMOS also uses less energy. Each pixel in a CMOS image sensor has its own photodiode that outputs the signal directly.
CCDs in contrast, are based on NMOS, and they are considered less noisy than CMOS. As a result, they are favored for scientific equipment and high-resolution scanners. CMOS is used in most other applications as it is lower power and cheaper, and always improving its accuracy or resolution.
The basic unit that that captures image data is the pixel. Adding more pixels to image sensors is a way to increase resolution of images. Smaller pixels mean more pixels can fit into the same area while increasing resolution. Called the pixel pitch, the density of pixels on an image sensor is increasing.
Adding pixels does affect the testing of image sensors. “In recent years, there has been an explosive increase in pixel count for CMOS image sensor (CIS) devices, with the result that testing of CIS devices tends to take longer,” said Advantest’s Chiezo Saito, from the T2000 group, who wrote a paper about speeding up image processing to test image sensors.
What does affect that testing is the shrinking of the sensors. OmniVision has shipped tiny 2-layer image sensors for years. “The testing gets challenged as you make smaller and smaller, because the test equipment and hardware has to scale.”
The consumer of an image can be a human, a computer, or a machine viewer. Consequently, the image sensor choice depends on the end use and the end user. If a human is not viewing the image, or the purpose is to detect a specific event, new types of vision sensors such as the event sensor are being considered. In low-light situations and automotive in-cabin monitoring, the event sensor detects changes in intensity only, and the pixels are activated by a change. Otherwise, no other image data is needed or collected.
Fig. 3. A CMOS (complementary metal oxide semiconductor) image sensor (CIS) converts light through pixels, through photodiodes into electrical signal output. Source: TEL Nanotec Museum
Image sensors are manufactured on mature nodes in 200mm and 300mm fabs. Some image and video sensors include the image or video processing block. These are SoCs. The other type of image sensors that do not have a video processor are called RAW.
“In terms of popularity, the RAW sensor represents the majority of [vision] sensors worldwide,” said OmniVision’s Patel. “The video processor is something proprietary to a lot of corporations — their own algorithms, recipe, and know how. It’s always better to use an external vision processor for those reasons. And you can also use an advanced process node to integrate more advanced algorithms for image processing. That’s pretty much what you see, but there are still some SoC sensors available in the market.”
Sometimes having the image sensor separate from the processing elements is preferred for other reasons. “In the automotive sector, there are use cases where it is preferred to have video processing separate,” said Patel. “The rear-view camera in your car is way in the back in your trunk, but you’re displaying the video on the console. That is a long distance. The designers would rather have either digital output, or a traditional analog output like NTSC or PAL video output so you can use a long wire for the display.”
The signals/image data is moved off the image sensor using parallel and serial output. It may seem counterintuitive, but the serial output is becoming more popular than the parallel, with MIPI dominating.
“In most vison applications, power and latency are key parameters,” said Mixel’s Takla. “MIPI was optimized to address this kind of application, since its original target has been communication of video data between sensors, processor, and displays. Because of that, MIPI standards evolved to minimize power, latency, and address the asymmetric nature of a typical video link. We see wide interest in the C-PHY and an accelerating adoption in video applications, primarily in sensors. Adoption on the display side is growing but lagging sensors.”
A serial interface beats out parallel interfaces in vision. “For the parallel interface, if it’s 8-bit output or 10-bit output, you have essentially 12 lines and then some other signal to go with it. That’s the traditional data bus, where you parallelize. The serial interface is broadly the MIPI Consortium’s. They basically use the serial differential pins and outputs. One line includes two data pair, and then you have clock pair, so four pins, but actually you’re able to output much higher throughput from those four pins as opposed to parallel,” said Patel.
The parallel output has a transistor-transistor logic (TTL) swing that adds time to the transfer. “When you swing from zero voltage to a higher voltage, whether it’s 1.2 or 1.5, it takes time for the signal to rise and come down,” said Patel. “The challenge would be the speed that you can achieve, but from a simplicity point of view, parallel is the simplest way to output because most of the microcontrollers in the industry would have some sort of parallel input data bus. And you can just connect those directly into those.”
When it comes to high-resolution, high-frame output where the speeds go up, parallel falls short again. “If you want to output high frame rate at high resolutions, say 4K or 2K at 30 frames or 60 frames, the parallel port would be fairly challenging to output such a high frame rate. That will be a lot of power. You’re also talking about EMI. So those are all addressed with serial interface, which is the MIPI interface,” said Patel. “We see more and more new microcontrollers are also adding MIPI interface.”
Inside the vision or image signal processor, the data may be handled differently than within an SoC. “You’re using vision as a kind of a sensor, which is actually very data-intensive, and being able to take that kind of information and make sure that you can capture it, and that the information is connected and the data is flowing from that part of the chip to the rest so you can compute it, resolve, and learn from it,” said Michal Siwinski, Arteris IP’s chief marketing officer. “Vision is probably generating more data than other kinds of sensors. And by us being able to handle that really well, everything else is basically kind of straightforward. We provide basically that underlying system on chip infrastructure.”
Vision and image signal processors, vision IP
Once the electric signals leave the image sensor, they are processed in a vision processor (VPU) or an image signal processor (ISP), or both.
A typical computer vision application has cameras (with image sensors). The camera has some pre-processing. It captures the data frames, detects objects, tracks recognition, and then does some post processing. System designers have used microprocessors, CPU, and GPUs to process images. The graphics processor, designed for the needs of 3D and gaming graphics, initially had some advantage for processing images. But now vision processors exist for processing images using AI accelerators. The AI extracts only the necessary data from the signals and can be programmed for image recognition, or whatever helps end users find meaningful data or improve the image quality for their use case.
Examples of VPUs are:
The ISP traditionally is used in digital cameras to process a digital image that is optimized for human viewing. The ISP may apply white balance filters, raw data correction, lens corrections, dynamic range corrections, noise reduction, sharpening, digital image stabilization, and all sorts of other adjustments. The ISP also is being adapted to computer-vision applications. The ISP can be an IC or an IP core that will be used in an SoC, or on the image sensor itself ,and is often used in combination with vision processors. An example of an ISP is Arm’s Mali-C52 and Mali-C32 ISPs for real-time, higher image quality in IoT devices.
Vision and image signal processors used together can produce more efficient computer vision processing than vision processing on its own.
In addition, the EDA industry offers IP for vision systems and tools for creating and testing vision system designs. By vendor, applications include:
Pre-trained models for AI inferencing of images can be a cost-effective way to pull meaning out images. Flex Logix is offering pre-trained models for vision applications that do specific tasks, such as hard hat detection, license plate reading, vehicle detection, drone, PCB defect inspections, among others. Called EasyVision, the AI models work with Flex Logix’s InferX AI edge accelerators.
“The algorithms exist, the training has occurred, but you want to build it into something that is robust and cost-effective,” Fuller said. “That’s where this whole kind of service makes a lot of sense, because the scientific development, converted to engineered product, is a process that still needs to be done, and something that we’re really focused on helping customers do.”
Vision processing on the edge
While the image sensor may not be built into processors on the edge — or vice versa — the image sensor is being paired with new processors and SoC combinations using AI on the edge for image processing. Flex Logix is among the startups that are tackling AI-edge image processing capabilities, as mentioned. Others are SiMa.ai, which created MLSoC (Machine Learning System on Chip) platform for computer vision using Synopsys’ DesignWare ARC Embedded Vision Processor and Arm’s compute IP. Other startups working on edge AI for vision include Ambarella, Brainchip, Hailo.ai, among others.
An example of a vision edge device with AI is the tiny AR/VR camera from Au-Zone is NXP’s i.MX RT1064 MCU — based on the Arm Cortex-M7 vision — which runs up to 600 MHz. Au-Zone’s camera is designed for IoT uses also. NXP’s i.MX RT1064 MCU is part of NXP’s RT Crossover series for IoT and edge AI, which includes audio as well as vision applications.
But the ultimate edge is still augmented reality/virtual reality (AR/VR). That market is just starting to hit the consumer electronics industry’s radar after more than two decades of R&D, in large part due to the speed of the chips used in them, much lower power, and price points that are starting to make it attractive.
Miniaturization, multiplication
Cameras and their ICs are getting smaller and more complex, with image sensors gaining multiple layers. One driving force is AR/VR in glasses or a headset, where the multiple tiny cameras are needed to focus on a particular aspect of a human interaction, for example. A separate camera may be needed to focus on a person’s lips, eyes, and hand gestures.
With AR/VR, multiple cameras will be needed in the bezel of eyeglasses. “The need for performance plus size is very important,” said OmniVision’s Patel. “We are sampling our global shutter image sensor, called OG0TB that uses three-layer stacking.”
Fig. 4: OmniVision’s three-layer image sensor OG0TB that has an analog, logic, ADC, MIPI/CPHY in addition to the pixel wafer. Source: OmniVision
“With the three-layer stacking, it allowed us to make camera size that is 1.6 x 1.6mm, and the size of that compactor will allow you to put those in next generation AR/VR devices,” said OmniVision’s Patel. In the future, the layers may enable the sensor to avoid transmitting unneeded data. “The long-term vision is one of the reasons we went through layer stacking. You can imagine in the future that we could put some block, CNN network, or some other digital functions, where you need to send them all the RAW data. But if the purpose is just to track your eye, why send all the RAW data. Instead, just sent the x,y coordinate. If you had this processing block on this on the sensor itself, and we were able to achieve that goal, it would help use the bandwidth going on the host, because you can imagine how many cameras are fighting for the interface.”
AI handling data from multicameras is being designed into chips. Renesas just launched an AI-capable microprocessors in its RZ/V Series that can do AI processing of image data from multiple cameras, to boost the accuracy of image recognition for vision AI applications, such as AI-equipped gateways, video servers, security gates, POS terminals and robotic arms.
In bigger systems, such as automotive, drones, robotics, multiple cameras and vision systems are the norm, although they aren’t always miniature. “Often there is more than one vision system. Usually, you’re feeding data through an array of cameras and so forth,” said Paul Graykowski, a senior technical marketing manager for Arteris IP at the time of this interview. “You’ve got to ensure that you have the bandwidth and the latencies to deal with that.”
If a drone is flying autonomously, it uses cameras for vision sensor to avoid obstacles. “If it’s flying and can’t process the data fast enough, and boom, it hits a tree. We’ve got to ensure that we have that tight coupling within the data path to make sure we’re feeding those processors,” said Graykowski. “We obviously don’t want them sitting idle either. That’s one of the key things when you’re dealing with in vision — optimizing for performance to get the data through in a timely manner, so that you can do your machine learning on that data and then react accordingly. And that’s the case whether it be ADAS, a drone, robotics, in any type of vision system it is very important to get almost real time.”
Reacting as is close to real-time as possible is the ideal and the intent of the system.
“It’s being able to optimize the configuration of the NoCs and having that flexibility with it to specify, ‘I’ve got to move this data from point A to point B, it’s got to be there in this time,’ and just all those quality-of-service parameters to ensure that you’re hitting that goal,” said Graykowski.
Real-time is not always needed. Sometimes letting a system “think” is okay. A system designer needs to know what data needs to be sensed, moved through AI accelerators, crunched through CPUs to respond, all in near real-time. An autonomous drone is good example. It may take a little while for the drone to reroute as it thinks through the possibilities. A drone in the air might not be in as much immediate danger, though, as a car on the ground, which does real time in the ADAS to avoid crashing.
“Vision has a lot of data coming from multiple sources. It is pretty much taking that data from all those sources and getting it to the processing units or say that AI units very quickly so they can do their analysis on it,” said Graykowski. “Once they have an analysis and identified what’s going on, that’s going to be passed to the CPU to do some thinking about, ‘Hey, this is how we’re going to react to it.’ You need a very fast data link to move the data in a timely manner, because most vision systems are reacting to things in real time. That’s the key with optimizing around vision — moving the data at time. Obviously area, power, and performance always come into it, but the reality is in this type thing, you have to ensure that you’re designing the NoC to not starve off your AI engines so they can do the compute that they need to.”
Conclusion
Image sensors are getting smaller and gaining pixels and layers, while vision processing SoCs are designed now with AI accelerators and edge AI processing. Cameras are ending up everywhere and are smaller, smarter, and more efficient. Vision systems run the gamut from cameras in the mobile phones to industrial automation, automotive, medical, security, and on and on.
But understanding if the viewer is a computer, machine, or a human makes a big difference in system design where extra data can be discarded. Also, new choices in sensors and AI, ML vision processing will help teams make leaner vision system designs and gain more higher quality images.
Related stories:
Scaling CMOS Image Sensors
Manufacturing issues grow as cameras become more sophisticated.
Detecting Spatial Blotches In Image Sensor Devices
A mixed statistical model to evaluate a common image sensor defect.
Leave a Reply