There’s a lot more to boosting system performance than throwing cores or memory at the problem.
By Frank Ferro
Now that the iPhone 5 hype is quieting down, the discussion has turned to the A6 chip that is powering this must-have device. There is much speculation on what is inside the A6 processor. Is it a dual-core A15 or a custom architecture? Is it a ‘big.LITTLE’ architecture? What speed are cores running at—1.2GHz? Others argue that the graphics processor is of equal importance to the CPU for the overall user experience. In any case, Apple is boasting a 2x performance improvement over the previous generation iPhone.
This discussion, as expected, has expanded to rival CPUs like Qualcomm’s custom Krait core, used in Snapdragon, or Intel’s Atom processor. With all the talk of the processor performance (CPU and GPU), I found it interesting that there was only one brief reference to the memory architecture of the A6. At MemCon last week during the keynote presentation by Martin Lund, senior vice president at Cadence, mentioned the importance of ‘compute, interconnect and storage.’ He then continued on to discuss the time and energy engineering teams spend optimizing the memory interface to minimize latency. Of course at a memory conference we expect the focus to be on the memory, but the point is well taken. The CPU is only one part of the equation.
For the best system performance, proper consideration needs to be given to interconnect and memory architecture also, or all the CPU speed and internal architecture efficiency of the CPU will not be realized in the system. And getting the right SoC efficiency between CPU/GPU, interconnect and memory system makes a big difference on battery life as well as user-observed performance.
From my biased view of the world, I would agree that the on-chip interconnect, and to a lesser degree the memory subsystem, do not get the attention they deserve when considering the performance of these SoCs. It is certainly easier and more intuitive to talk about how the raw megahertz and the number of CPU cores translate into a better product than how the performance of the interconnect or the memory affects the system. For the SoC architects and design engineers, however it is another story. They are by necessity “getting it.” All the CPU/Graphics performance in world does not translate into better system performance in a heterogeneous compute environment if you have a poor interconnect and memory subsystem design.
Multi-threading and QoS techniques have been around for a long time as a way to improve system concurrency and memory efficiency. Even so, these architectural enhancements have not been widely used in embedded consumer SoCs. As I was scanning articles it was interesting to see that the new RAZR i phone uses a 2GHz single-core Medfield Atom processor with Hyper-Threading Technology. This technology allows a single core to do “two things at once,” and by taking advantage of concurrent processing they claim much lower power consumption compared to “ramping up dual cores.” Having two ‘virtual cores’ is a good illustration of how efficiency is gained by fully utilizing the existing hardware rather than throwing more cores at the problem.
The same concept of multi-threading can also be used—and given applications processor performance requirements really must be used—to get maximum utilization of the on-chip communications network. (I intentionally changed the terminology here, because we are now talking about more than a simple interconnect.) The use of multiple threads or virtual channels optimizes the utilization of connections in the network by combining physical connections that are under-utilized. Now one physical link appears as multiple logical links. This of course saves wires, but there are other advantages (keep reading).
Blocking and tackling. To help manage traffic from multiple processors (CPU, GPU, etc.) competing for DRAM resources, quality-of-service (QoS) or a priority setting is applied to the data. This can work fine, but trouble often arises when each core asserts that it has the highest priority traffic—effectively giving no one priority. This is known as a panic failure. Other failures can occur like ‘head of line blocking’ when higher-priority traffic gets stuck behind lower-priority traffic. Having virtual channels at your disposal will help the system designers avoid these blocking scenarios because the virtual channel enables a ‘passing lane’ for more important traffic. When one logical connection cannot make progress, other data that is not blocked is free to flow over the same physical link. This advantage here is much better system concurrency because processors spend much less time waiting (i.e. being blocked) for resources.
Get what you want and not what you need. At recent sales refresher course I was reminded that customers will not buy a product, no matter how good it is, if they don’t think they need it. This is a perfect summary for understanding why many of the non-blocking concurrency techniques that have been around for a long time are just starting to be implemented on a broader scale in embedded SoCs. Throwing more processor cores at the problem can become impractical for power- and cost-sensitive applications, along with the fact that DRAM bandwidth efficiency also is not moving forward fast enough from a mobile perspective. So SoC designers are now being forced to get more efficiency out of the system in order to take full advantage of all the CPU power available to them. Now they want and need solutions like non-blocking virtual channels for better system concurrency that is demanded for today’s smart devices.
—Frank Ferro is director of marketing at Sonics.
Leave a Reply