Fine-Tuning Humanoid Vision And Movement

Ongoing innovations are enabling humanoids to see and move more like humans, while smell and taste technology are also being explored.

popularity

Key Takeaways:

  • Humanoids and autonomous vehicles share a zonal architecture, a need for FuSa, and a reliance on radar technology to see around corners and through objects.
  • Multiple cameras with different fidelities can better replicate a human’s field of view; for example, some can provide wide fields of view at low grade, low fidelity, while others strive for infinite fidelity in a very small space.
  • Natural movement requires calculating forward kinematics, inverse kinematics, distance, and depth in real time, and this must be achieved efficiently to not burn through battery power.

Humanoid robots still have a way to go to move safely among humans in a range of settings, including hospitals and homes. While cameras and radar enable good vision, there is little room for error, especially when children and the elderly are nearby. With autonomous vehicles already deployed on city streets, robotics developers are taking note of their corner-case mistakes and building vision and movement technology to avoid accidents.

Humanoids are one type of physical AI, and technology is progressing on every front, from advanced compute to miniaturized sensors, to ready these systems to be mass-deployed in the world sooner rather than later.

“You can break down the robots into many elements, from the main compute to zone, motor control, battery, charging, and environmental sensing,” observed Adam White, division president of power and sensor systems at Infineon Technologies. “Robots are similar to automotive vehicles in a few ways, including zonal control. Maybe you need FuSa, for example, with functional safety for the microcontrollers. You may need special Ethernet. Then there’s the motor control. We can reduce the size of the joints by using gallium nitride for the motors. The other benefit of that is you will be able to save battery life. For the environmental sensing/sensing the physical world, we are now doing a lot more reference designs to make sure we can offer integrators and robotics companies the best possible sensor, whether it’s radar, for example, with your eyes, whether it is CO2 or other gases for your nose, or whether it’s capacitive sensing where you touch something.”


Fig. 1: A humanoid hand. Source: Infineon

The system-level architecture of a robot is largely derived from autonomous vehicles. At the front end is a perception-sensing suite, which typically comprises cameras, at a minimum.

“Most vehicle OEMs also integrate radar sensing to compensate for the cameras’ inability to perform well at night or in low-light conditions, followed by the addition of lidar in certain classes of vehicles,” said Amit Kumar, director of automotive product management for Tensilica products at Cadence. “Looking ahead, thermal sensing is being introduced to add redundancy for nighttime driving. Robots are not very different, as they also need to operate in a mix of challenging weather and lighting conditions, making these sensors essential. However, when the form factor is humanoid, we must consider additional sensing modalities.”

Those include microphones (voice), tactile/haptic sensing (touch), and potentially sensor arrays for smell in the future.

Safety is one of the main overlaps between automotive and robotics. “Whether it’s industrial robotics, a humanoid, an arm, or some complex robotic device that doesn’t look anything like a human and is going to find its way into industry, all have safety consequences,” said Matthew Bubis, director of product management at Imagination Technologies. “These are heavy, quite dangerous machines, and they also have a whole range of camera inputs and lidar-type technology, if it’s required, to process complex mechanical machinery. This means the chip designs between robotics and automotive are very overlapped. We see customers starting in automotive on their chip designs to prepare for robotics. They take our GPU IP for conventional compute, such as edge detection or working out object classification in simple sensors, as well as complex AI. They ask, ‘Do I need to have an LLM-based system for a complex, chain-of-thought approach to working out what I need to do next?’ They can use that chip for automotive, and they’re also thinking about whether they need to apply all the safety standards that automotive GPUs have.”

As such, the robotics industry is waiting to see what form factor and use case gains popularity.

“There’s a whole range of robotics startups at the moment,” said Bubis. “The assumption is that a portion of them might take off and have a mass production order for a particular robotic set. Our customers provide their chips to those startups along with nascent robotics companies in the hope that the company they partnered with succeeds. It could take five years because the company’s got to solve all sorts of mechanical challenges, as well as the AI software challenges of robotics. But once they solve those problems, they’ve got the chips already in that supply chain. In the meantime, the companies designing those chips are sustaining themselves through automotive channels.”

While humanoid robot progress is fast, there is still much that needs to be done to make humanoids more human. “There are many sensor challenges that must be addressed to achieve deterministic perception in unpredictable physical environments subject to variable lighting, vibration, occlusion, reflective materials, uneven terrain, and close human interaction,” said Edo Cohen, chair of the MIPI Alliance Physical AI Birds of a Feather (BoF) Group. “One major challenge is synchronizing the many different sensor types used within humanoids, including cameras, depth perception, tactile sensors, audio, force/torque sensing, and joint-position feedback. The data from these sensors must be reliably fused for balance, manipulation, and safe motion. Another significant challenge is efficiently moving and processing the large volumes of data generated by numerous high-, mid-, and low-bandwidth sensors with low and predictable latency, accurate timestamping, and assured data integrity between each sensor and its associated compute engine.”

Just as the internet and new connectivity technology saw an explosion of mobile phones and other devices, physical AI will see a boom of devices at the edge. “We’re at that point in our thinking of, ‘Here comes physical AI — what’s the big killer application?’” said James Prior, head of marketing at MIPS, a GlobalFoundries company (formerly owned by Synopsys). “The cycle of innovation is going to be way shorter. Innovation is going way, way faster. In the next 5 to 10 years, we’ll see an explosion of devices shipping.”

As noted in a recent Boston Consulting Group report, “The challenge is not whether progress is occurring — it is how to interpret it. Improvements in perception, dexterity, planning, and reasoning are unfolding at different speeds, and highly visible demonstrations can obscure which capabilities are mature and which remain experimental.”

Mobility arguably has the furthest to go. “Vision, speech and basic manipulation are the most mature today because they benefit from the AI and sensor ecosystem,” said Chen Su, head of edge AI product marketing at Nvidia, which recently announced a full-stack, open robotics safety system as an extension of its Halos autonomous vehicle safety. “The harder problems are touch, whole-body reasoning, and mobility in unstructured environments where the robot has to interpret force, balance, contact, and reasoning in real-time. Capabilities like climbing stairs reliably, handling fragile objects or adapting to unexpected human behavior remain significantly more difficult than a task like recognizing speech.”

Vision
While a serious number of engineering hours have already been dedicated to solving vision for autonomous vehicles and other systems, there is always room for improvement, and humanoids have special needs due to their proximity to humans.

“Vision, sound, and acoustics are more important than voice in industrial,” said John Weil, vice president and general manager for IoT and edge AI processor business at Synaptics. “Vision systems in consumer are showing the way, and industrial will catch up quickly. There are industrial cameras, where a camera is mounted to some object. It could be a robot arm or an inspection system. Where things are going is the merger of multiple cameras to create a more complex field of view.”

This fusion of multimodal sensing data enables new functions. “The simplest one is continuous vision at a reduced fidelity,” said Weil. “It generates less data to be able to process at very high speed, then it knows if something is important and time-syncs that to higher resolution, or even a different light space. You might use one camera at one resolution to inspect something very fast. You may have another camera in the same field of view at a different color space, or different light, and it may be very high resolution. Then, you use the low-resolution camera to identify the object and then point the high-resolution camera at that field of view within the bigger space. From a robotics point of view, there’s a lot of flexibility in how people implement these capabilities, and we’ll see that happening.”

MCUs are generally enough to handle vision tasks, but it will depend on specs. “There’s a lot of vision use cases and vision systems with different resolutions, speeds, and specmanship,” Weil noted. “Is it 4K 120 hertz cameras? 4K 30? 2K 200? Different people want different things today.”

Even if processors could run all images at high resolution, multiple cameras at different resolutions are still likely. “You’regoing to have multiple perceptions,” said Weil. “Human vision fidelity gets better as it moves closer to their center. It’s hard to match that with a camera on a computer. But when you think about power and the amount of energy and how we look at the image, we have a low-power perceptive view that sees the world but doesn’t see it in a lot of detail. As we increase the detail, it takes more compute. That’s why we’re going to see multiple cameras and multiple compute levels, because we’re going to be able to have these really wide fields of view, and really perceptive environments at low grade, low fidelity. Then we’ll have scenarios when we want to know infinite fidelity in a very small space. Multi-modality means you have multiple cameras, and you’ll be aligning all that information, but you’ll be aligning it at a different fidelity.”

Industrial robots
In an industrial robotics setting, vision is a primary sense for inspection, quality control, and counting. In this Arcturusdemo, object detection and color analysis are used to identify pills and detect contaminants on a conveyor belt.

Most OEMs are mainly using cameras right now. “Cameras have limitations,” said Giovanni Campanella, robotics and industrial automation general manager at Texas Instruments.  “For example, in dark conditions, or when there’s smoke or dust under sunlight for very, very high-speed objects, by adding additional sensors, such as radar, sensor fusion allows the robot to detect those blind spots. ASIL-certified radar devices can create a safety bubble around the robot, where they can detect any objects or human kids very accurately in a fail-safe way.”

Humanoids borrow vision safety features from automotive. “We’re seeing robots with electromagnetic radars, the same as in a vehicle,” said Matt Commens, director of product management at Synopsys. “You might be driving toward the sun, and you can’t see. Your camera can’t see. But electromagnetic waves can see. So we’re seeing robots with optical sensors, as well as electromagnetic sensors.”

Vision simulation for robots, autonomous vehicles, and drones includes light, as well as lidar or radar. “How do you interpret the information coming back from sensors with a camera or an antenna, and how do you interpret that into a picture of the real world?” said Commens. “There’s technical simulation to figure out where the edges of the road are, where people cross the road. Then, the next layer up, we have a tool that looks at the human perception of light. When you’re doing any sort of lighting design, you’re doing it for people, typically, and our eyes are very nonlinear, very adaptive. The question is, I can see what the physical specs are, but what will this look like in my cockpit or my car screen, and what if there’s sunlight shining on it? Will I still be able to see it? With these textures of materials, what would this actually look like? The tool simulates a picture of what it will look like to a human, which is a form of simulation that’s different than the pure physical.”

Among the recent vision research:

Technical Paper Research Organizations
VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands 🔗 Tsinghua University, HKUST, Xiaomi Robotics Lab
Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation 🔗 MARS Lab, Nanyang Technological University
A Modular Vision System for Practical Object Detection on Resource-Constrained Humanoid Robots 🔗 Laurentian University
Lightweight High-Performance CNN Design for Mobile Robotic Vision Systems 🔗 University of Leicester
A Lightweight Temporal Attention Mamba Network for Self-Supervised Monocular Depth Estimation 🔗 Jishou University
YOLO with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection with trustworthy multimodal AI in computer vision perception 🔗 University of Bath, Zhejiang University

Movement
Humanoid robots have evolved from automated storage and retrieval systems (ASRS) to autonomous mobile robots (AMR). The next phase is general-purpose robots. Once industry has solved the challenge of putting an upper body on an AMR with arms, fingers, and opposable thumbs, the next task is general-purpose movement. AI and the ability to run processing on the edge are enabling these next literal steps.

“If I want a robot to pick something up, I calculate forward kinematics, inverse kinematics, the distance and the depth. I go there and meet it slightly here or slightly there,” said Tapan Pattnayak, distinguished scientist at Addverb. “But does a human being do that? Does the human brain calculate, what is the XYZ there, and then do it? No. Humans think, ‘There is something. Maybe I will incline my leg slightly there and bend my knee slightly there, then I pick it up and come back here.’ That’s constant intelligence, and that is what we want to crack right now. That is physical AI.”

Hardware and software are equally important as far as humanoid functionality. “It’s 50/50 — they are both important,” said TI’s Campanella. “The hardware piece includes the sensors, motor drives, and actuators. The majority of the hardware is actuators, and they are expensive. Especially in the hands, there may be up to 30 actuators in the hands, along with the BMS piece and the power interfaces. There’s a lot of hardware, and the software piece brings everything together.”

Once humanoids have full articulation and mobility, domestic cleaners could be one of the most desirable use cases, as seen in this demonstration of two Figure AI humanoids making a bed. Figure AI holds numerous humanoid-related patents, with many focused on movement and kinematics.

Recent movement research:

Technical Paper Research Organizations
LEGO: Latent‑space Exploration for Geometry‑aware Optimization of Humanoid Kinematic Design 🔗 Korea University
Learning Social Navigation from Positive and Negative Demonstrations and Rule-Based Specifications 🔗 Korea University, Yonsei University, Carnegie Mellon University, Queen’s University
Spatio-Temporal Motion Retargeting for Quadruped Robots 🔗 Korea University, ETH Zurich, UCLA
Learn to Quantify Social Interaction with Constraints for Pedestrian Walking 🔗 Stockholm University
TAVEN: Task-driven Adaptive Viewpoint Exploration for Training-Free 3D Spatial Reasoning and Understanding 🔗 Singapore University of Technology and Design, The Chinese University of Hong Kong
Receding Horizon Trajectory Optimization Through Waypoints and Path Segments 🔗 TU Wien
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction 🔗 Allen Institute for AI, University of Washington, UNC-Chapel Hill
Balancing Holding Torque and Dynamic Performance: Air-Lubricated, Friction-Utilizing Shoulder Joint for PAM-Driven Humanoids 🔗 Kyoto University
A Dynamic Motion Planner for Trajectory Tracking in HRC 🔗 Ruhr University Bochum, University of Siegen
MARCH: Model-Assisted Reinforcement Learning for the Perceptive Control of Humanoids over Sparse Footholds 🔗 Tufts University

Power
The next challenge is efficiently powering robots to move. “If you want every robot to work for one hour, but it can only operate for 10 minutes before it has to go and change its battery, that’s not the right approach,” said Sathishkumar Balasubramanian, head of products at Siemens EDA. “Energy is going to be the biggest problem. For all the other senses, we have seen enough papers and research that we appear to be figuring it out at a very fast pace.”

For a robot, the battery power consumption will depend on its end application. While an efficient chip can help reduce power use, the main consideration is the force those robots need to carry on — how many actuators they will be equipped with and how many motors.” “Typically on such devices, the biggest power consumption will be generated by the motors, not by the silicon,” said Nebu Philips, senior director of strategy and business development at Synaptics. “In the case of touch, low power could be pretty important, because if you would like to provide a lot of data from multiple points, having this sensing as efficient as possible will improve the lifetime of the robot or the battery charge. Even though you have to provide a lot of data and pre-process it, this is not the main driver on the system. It’s hard to say how long on the battery a robot could really survive without knowing what it is doing.”

Power is equally critical in automotive as robotics. “It used to be, and still is in some cases, that if you want to provide this powerful compute capability in a car, you’re going to need quite an expensive cooling system, and you’re going to start draining the battery,” said Imagination’s Bubis. “Neither of those things is positive in the auto industry. They want to have air-cooled chips, and they want to have nothing draining their battery because the competition for range is really critical. In robotics, if machines are not connected by direct power or wired connections — and you can imagine them in a factory setting, wanting to move around that factory environment — then you’re going to want your onboard chips to be as low power as possible when they’re processing that compute.”

The need for low power even affects what data types customers are asking to be supported. “Lower precision data types mean less power demanded, less memory, smaller memory systems, and smaller chips,” Bubis noted.


Fig. 2: Humanoid robot system, including power management. Source: Infineon

Smell and taste
The human senses of smell and taste are essential not only for enjoyment but to guide humans to eat safe food and avoid toxic gases. These are use cases where humanoids can help.

“Looking ahead, as robots become more integrated into daily life, another sensing modality may emerge — the sense of smell,” said Cadence’s Kumar. “Gas sensors are already widely used in industrial applications, but adapting them to function as a robotic ‘nose’ is still an evolving area. For example, a gas sensor array with varied sensitivity patterns could act as olfactory receptors. If combined with neural network models trained on datasets representing common human-recognizable smells, it may be possible to replicate a basic sense of smell in robots.”

At Harvard University’s School of Engineering and Applied Sciences, graduate student Haritosh Patel and his team developed an e-nose that can detect a wide range of gases, and even “sniff” via a system that has mini fans mounted on a CPU, with an MCU for AI/ML processing, and oxide-based sensors. One application is to help robots find people trapped in a dangerous place.

“Robots can be paired with these sensors to find if someone needs rescuing,” said Patel. “It can be used for first response, such as firefighters. Oftentimes they are making quick decisions on the fly — a lot of difficult decisions around how to save someone in a building. The e-nose can be used as a sense, and along with VR glasses and AR glasses, it can guide the firefighter to the optimal path to the person they want to rescue. That’s not always a straight path. It might be a path that’s a bit more tortuous, but it’s safer from a chemical perspective, and what they’d be ingesting and inhaling. From different applications, an e-nose stems from the fundamental question of, can we translate how we smell — the sense of smell — into a device and technology?”

The same gas detection capabilities are helpful in an industrial setting, where mobile robots can get closer to leaks than fixed sensor devices or humans.

As for taste, research is emerging at various institutes and showing up more in the media. Use cases include profiling high-value food such as coffee, wine, and cheese, or formulating medicine.

Recent smell and taste research:

Technical Paper Research Organizations
Volatile, Electronic Tongue (e-tongue) and General Analysis 🔗 Washington State University
A matter of taste: Electronic tongue reveals AI inner thoughts 🔗 Penn State
A Soft and Flexible Artificial Tongue for Pungency Perception 🔗 East China University of Science and Technology
Highly Sensitive, Low-Energy-Consumption Biomimetic Olfactory Synaptic Transistors Based on the Aggregation of the Semiconductor Films 🔗 Hefei University of Techn
Artificial Intelligence and Olfaction: A Survey on the Sense of Smell for Robotics 🔗 UT Dallas
Flexible electronics in humanoid five senses for the era of artificial intelligence of things, AIoT 🔗 Southeast University, Nanjing, Southeast University Suzhou Campus, Xiaomi Smart Home Appliances, et al.

Conclusion
To interact safely with humans, humanoid robots need to see well, calculate accurately, and move naturally without consuming copious amounts of power. In many ways, humanoids are like autonomous vehicles, but they are more complex.

“You receive lots of information from radar, and you have to make decisions,” said Synopsys’ Commens. “The car can stop. And to turn, it’s going to turn the wheels. Robots are the same concept, but they have more moving parts. They have to make more decisions, and they are not operating on very strict, true streets.”

Along with this need for agile, real-time movement, humanoids can eventually replicate the full range of human senses, opening up an ever wider range of use cases, from sommelier to caregiver for the elderly.


Related Articles

Humanoid Touch And Voice Are Improving Rapidly
General-purpose humanoid robots need all their senses to function equally well; vision and movement are the farthest along, but others are catching up.

Flexible ICs, MEMS, Metal Oxides Solve Fresh Problems
Existing technology is being upcycled and deployed in new ways as companies seek to measure more data, more accurately.

Increasing Roles For Robotics In Fabs
AI and robotics are taking on bigger, more complex, and increasingly autonomous tasks, but integration with existing equipment and processes remains a formidable challenge.



Leave a Reply


(Note: This name will be displayed publicly)