OoOE

You’re going to come across the phrase out-of-order execution (OoOE) a lot here, so let’s go through a quick refresher on what that is and why it matters.

At a high level, the role of a CPU is to read instructions from whatever program it’s running, determine what they’re telling the machine to do, execute them and write the result back out to memory.

The program counter within a CPU points to the address in memory of the next instruction to be executed. The CPU’s fetch logic grabs instructions in order. Those instructions are decoded into an internally understood format (a single architectural instruction sometimes decodes into multiple smaller instructions). Once decoded, all necessary operands are fetched from memory (if they’re not already in local registers) and the combination of instruction + operands are issued for execution. The results are committed to memory (registers/cache/DRAM) and it’s on to the next one.

In-order architectures complete this pipeline in order, from start to finish. The obvious problem is that many steps within the pipeline are dependent on having the right operands immediately available. For a number of reasons, this isn’t always possible. Operands could depend on other earlier instructions that may not have finished executing, or they might be located in main memory - hundreds of cycles away from the CPU. In these cases, a bubble is inserted into the processor’s pipeline and the machine’s overall efficiency drops as no work is being done until those operands are available.

Out-of-order architectures attempt to fix this problem by allowing independent instructions to execute ahead of others that are stalled waiting for data. In both cases instructions are fetched and retired in-order, but in an OoO architecture instructions can be executed out-of-order to improve overall utilization of execution resources.

The move to an OoO paradigm generally comes with penalties to die area and power consumption, which is one reason the earliest mobile CPU architectures were in-order designs. The ARM11, ARM’s Cortex A8, Intel’s original Atom (Bonnell) and Qualcomm’s Scorpion core were all in-order. As performance demands continued to go up and with new, smaller/lower power transistors, all of the players here started introducing OoO variants of their architectures. Although often referred to as out of order designs, ARM’s Cortex A9 and Qualcomm’s Krait 200/300 are mildly OoO compared to Cortex A15. Intel’s Silvermont joins the ranks of the Cortex A15 as a fully out of order design by modern day standards. The move to OoO alone should be good for around a 30% increase in single threaded performance vs. Bonnell.

Pipeline

Silvermont changes the Atom pipeline slightly. Bonnell featured a 16 stage in-order pipeline. One side effect to the design was that all operations, including those that didn’t have cache accesses (e.g. operations whose operands were in registers), had to go through three data cache access stages even though nothing happened during those stages. In going out-of-order, Silvermont allows instructions to bypass those stages if they don’t need data from memory, effectively shortening the mispredict penalty from 13 stages down to 10. The integer pipeline depth now varies depending on the type of instruction, but you’re looking at a range of 14 - 17 stages.

Branch prediction improves tremendously with Silvermont, a staple of any progressive microprocessor architecture. Silvermont takes the gshare branch predictor of Bonnell and significantly increased the size of all associated data structures. Silvermont also added an indirect branch predictor. The combination of the larger predictors and the new indirect predictor should increase branch prediction accuracy.

Couple better branch prediction with a lower mispredict latency and you’re talking about another 5 - 10% increase in IPC over Bonnell.

Introduction & 22nm Sensible Scaling: OoO Atom Remains Dual-Issue
POST A COMMENT

174 Comments

View All Comments

  • t.s. - Tuesday, May 7, 2013 - link

    If Intel play fair few years back, maybe now we have competitive offerings from AMD. That practice Intel's doing hurt AMD alot. Until now. Reply
  • Homeles - Monday, May 6, 2013 - link

    I'm sure Anand would be drawing plenty of comparisons if he had a Temash tablet in hand. Reply
  • Bob Todd - Monday, May 6, 2013 - link

    As an owner of two Bobcat systems (laptop/mini-itx), I don't think a 25% boost from Jaguar is going to get us into the realm of "good enough" cpu performance for general computing in Windows. The same goes for Intel unless Silvermont is significantly faster than Jaguar. I'm excited that Intel is finally bringing something interesting to the table, even if we end up two to three generations away from a good experience in Windows with their (and AMD's) mobile offerings. This sounds like it will make for a beastly dual core Android phone though, even at lower clocks. Reply
  • jjj - Monday, May 6, 2013 - link

    Hilarious difference in attitude when it comes to Intel.
    Tegra 4 gets into phones by "aggressively limiting frequency." while Intel " Max clock speeds should be lower than what’s possible in a tablet, but not by all that much thanks to good power management. "
    Objectivity at it's best.
    Reply
  • Homeles - Monday, May 6, 2013 - link

    Your scenario is a false equivalency. Reply
  • Krysto - Monday, May 6, 2013 - link

    Is it? I wouldn't accuse Anand of "objectivity" when it comes to Intel, whether it's on purpose, or involuntary. Reply
  • nunomoreira10 - Monday, May 6, 2013 - link

    The point is tegra 4 was not exactly made for phones while Intel was, for that you have tegra4i

    its not exacly nvidia fault, everybody complained that tegra 3 was lacking, now tegra 4 which is competitive consumes to much, atleast there is a choice.
    Reply
  • Homeles - Monday, May 6, 2013 - link

    A15s are big cores in relation to its relatives. The only way to fit not 2, not 4, but *5* of them in a phone on 28nm is to downclock them agressively. Just like the only way to fit Ivy Bridge in a tablet is to downclock it agressively.

    Anand did point out that the "the only A15 SoCs we've seen have been very leaky designs optimized for high frequency," and that if power consumption were prioritized instead (which I believe Tegra 4i is supposed to be), it would be less of a blowout.

    It's silly getting defensive about stock ARM cores anyways. It's not an attack on Nvidia by saying their stock ARM cores aren't all too spectacular -- it's not like they poured blood, sweat and tears into making their A15s the best thing ever.

    Finally, Tegra 4 is on a process that is rather significantly inferior to Intel's 22nm process. You think Nvidia would have to downclock agressively if they were on a level playing field and using Intel's 22nm process? I sure don't. But jjj and others here feel the need to get defensive whenever songs of praise are being sung about Intel, even when it's well deserved.
    Reply
  • extide - Tuesday, May 7, 2013 - link

    I am in agreeance with what you said, but I do believe Tegra 4i is Cortex A9, not A15 like Tegra 4. Reply
  • Wilco1 - Tuesday, May 7, 2013 - link

    The Korean Galaxy S4 has a 1.8GHz Exynos Octa, Tegra 4 does 1.9GHz. In what way are these "aggressively downclocked"? They run at their maximum frequency! Reply

Log in

Don't have an account? Sign up now