Apple's Cyclone Microarchitecture Detailed
by Anand Lal Shimpi on March 31, 2014 2:10 AM ESTThe most challenging part of last year's iPhone 5s review was piecing together details about Apple's A7 without any internal Apple assistance. I had less than a week to turn the review around and limited access to tools (much less time to develop them on my own) to figure out what Apple had done to double CPU performance without scaling frequency. The end result was an (incorrect) assumption that Apple had simply evolved its first ARMv7 architecture (codename: Swift). Based on the limited information I had at the time I assumed Apple simply addressed some low hanging fruit (e.g. memory access latency) in building Cyclone, its first 64-bit ARMv8 core. By the time the iPad Air review rolled around, I had more knowledge of what was underneath the hood:
As far as I can tell, peak issue width of Cyclone is 6 instructions. That’s at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.
With Swift, I had the luxury of Apple committing LLVM changes that not only gave me the code name but also confirmed the size of the machine (3-wide OoO core, 2 ALUs, 1 load/store unit). With Cyclone however, Apple held off on any public commits. Figuring out the codename and its architecture required a lot of digging.
Last week, the same reader who pointed me at the Swift details let me know that Apple revealed Cyclone microarchitectural details in LLVM commits made a few days ago (thanks again R!). Although I empirically verified many of Cyclone's features in advance of the iPad Air review last year, today we have some more concrete information on what Apple's first 64-bit ARMv8 architecture looks like.
Note that everything below is based on Apple's LLVM commits (and confirmed by my own testing where possible).
Apple Custom CPU Core Comparison | ||||||
Apple A6 | Apple A7 | |||||
CPU Codename | Swift | Cyclone | ||||
ARM ISA | ARMv7-A (32-bit) | ARMv8-A (32/64-bit) | ||||
Issue Width | 3 micro-ops | 6 micro-ops | ||||
Reorder Buffer Size | 45 micro-ops | 192 micro-ops | ||||
Branch Mispredict Penalty | 14 cycles | 16 cycles (14 - 19) | ||||
Integer ALUs | 2 | 4 | ||||
Load/Store Units | 1 | 2 | ||||
Load Latency | 3 cycles | 4 cycles | ||||
Branch Units | 1 | 2 | ||||
Indirect Branch Units | 0 | 1 | ||||
FP/NEON ALUs | ? | 3 | ||||
L1 Cache | 32KB I$ + 32KB D$ | 64KB I$ + 64KB D$ | ||||
L2 Cache | 1MB | 1MB | ||||
L3 Cache | - | 4MB |
As I mentioned in the iPad Air review, Cyclone is a wide machine. It can decode, issue, execute and retire up to 6 instructions/micro-ops per clock. I verified this during my iPad Air review by executing four integer adds and two FP adds in parallel. The same test on Swift actually yields fewer than 3 concurrent operations, likely because of an inability to issue to all integer and FP pipes in parallel. Similar limits exist with Krait.
I also noted an increase in overall machine size in my initial tinkering with Cyclone. Apple's LLVM commits indicate a massive 192 entry reorder buffer (coincidentally the same size as Haswell's ROB). Mispredict penalty goes up slightly compared to Swift, but Apple does present a range of values (14 - 19 cycles). This also happens to be the same range as Sandy Bridge and later Intel Core architectures (including Haswell). Given how much larger Cyclone is, a doubling of L1 cache sizes makes a lot of sense.
On the execution side Cyclone doubles the number of integer ALUs, load/store units and branch units. Cyclone also adds a unit for indirect branches and at least one more FP pipe. Cyclone can sustain three FP operations in parallel (including 3 FP/NEON adds). The third FP/NEON pipe is used for div and sqrt operations, the machine can only execute two FP/NEON muls in parallel.
I also found references to buffer sizes for each unit, which I'm assuming are the number of micro-ops that feed each unit. I don't believe Cyclone has a unified scheduler ahead of all of its execution units and instead has statically partitioned buffers in front of each port. I've put all of this information into the crude diagram below:
Unfortunately I don't have enough data on Swift to really produce a decent comparison image. With six decoders and nine ports to execution units, Cyclone is big. As I mentioned before, it's bigger than anything else that goes in a phone. Apple didn't build a Krait/Silvermont competitor, it built something much closer to Intel's big cores. At the launch of the iPhone 5s, Apple referred to the A7 as being "desktop class" - it turns out that wasn't an exaggeration.
Cyclone is a bold move by Apple, but not one that is without its challenges. I still find that there are almost no applications on iOS that really take advantage of the CPU power underneath the hood. More than anything Apple needs first party software that really demonstrates what's possible. The challenge is that at full tilt a pair of Cyclone cores can consume quite a bit of power. So for now, Cyclone's performance is really used to exploit race to sleep and get the device into a low power state as quickly as possible. The other problem I see is that although Cyclone is incredibly forward looking, it launched in devices with only 1GB of RAM. It's very likely that you'll run into memory limits before you hit CPU performance limits if you plan on keeping your device for a long time.
It wasn't until I wrote this piece that Apple's codenames started to make sense. Swift was quick, but Cyclone really does stir everything up. The earlier than expected introduction of a consumer 64-bit ARMv8 SoC caught pretty much everyone off guard (e.g. Qualcomm's shift to vanilla ARM cores for more of its product stack).
The real question is where does Apple go from here? By now we know to expect an "A8" branded Apple SoC in the iPhone 6 and iPad Air successors later this year. There's little benefit in going substantially wider than Cyclone, but there's still a ton of room to improve performance. One obvious example would be through frequency scaling. Cyclone is clocked very conservatively (1.3GHz in the 5s/iPad mini with Retina Display and 1.4GHz in the iPad Air), assuming Apple moves to a 20nm process later this year it should be possible to get some performance by increasing clock speed scaling without a power penalty. I suspect Apple has more tricks up its sleeve than that however. Swift and Cyclone were two tocks in a row by Intel's definition, a third in 3 years would be unusual but not impossible (Intel sort of committed to doing the same with Saltwell/Silvermont/Airmont in 2012 - 2014).
Looking at Cyclone makes one thing very clear: the rest of the players in the ultra mobile CPU space didn't aim high enough. I wonder what happens next round.
182 Comments
View All Comments
grahaman27 - Monday, March 31, 2014 - link
Rumors are rumors. I doubt it, there is little real world benefit (in mobile) and they already missed the quad-core hype window.Azurael - Monday, March 31, 2014 - link
Of course Qualcomm got it wrong, four narrower cores vs. two wider cores, when most applications aren't even using 2 threads on mobile... But this is all irrelevant until I can get an Apple Ax chip in an Android/Linux device. Right now, S800 is more than adequate for me...blanarahul - Monday, March 31, 2014 - link
Blame NVIDIA and Samsung. First to the market with quad core phones. Qualcomm was basically forced to move to Quad Core to compete in marketing. That's why the 8960Pro exists. It's basically a Snapdragon 600 without 2 cores + on die modem.Snapdragon 800 as a result was designed with Quad Core from the get go. But, then the market moved to Octa Core. Thankfully, Qualcomm appears to be sticking with 4 cores for their flagship SoCs.
dylan522p - Monday, March 31, 2014 - link
BS. Qualcomm was sticking with krait without increasing IPC at all since 2010 till 2015. Their "rushed quad core" was the S4 Pro which was fudmentally broken. The S600 was basically a fixed version and S800 is just more of the same they have doe for the past 4 years (changing process node and higher clocks). This is just CPU, they do some really great stuff with their modems and GPU and other areas.Anders CT - Tuesday, April 1, 2014 - link
It would be much better to have two twice as fast cores than four slow ones. But in reality twice the performance comes at a much higer cost in sillicon and power than x2.easp - Tuesday, April 1, 2014 - link
Yes, but doubling core counts doesn't double performance, and having twice as many cores that don't do much also has a high cost in terms of silicon.Apple's business model and the margins that go with it mean that a higher silicon cost is often acceptable. My guess is that their bigger concern is available fab capacity, at any price. Putting it another way, If a bigger chip means doubling their silicon cost, that might be acceptable. On the other hand, if it means they can only get half as many chips, and sell half as many iPhones, that's completely unacceptable.
Higher power consumption is also a complicated issue. Higher performance per watt is quite desirable, but the positive or negative impact is offset by the other participants in device power budget. On tablets, its generally the display, by a long shot. On a phone, the display is less significant.
It seems though that Apple seems to think that a big complex CPU offered sufficient advantages in terms of performance relative to the power consumption impacts.
tigmd99 - Monday, March 31, 2014 - link
RAM = battery life + costs. Battery life is an issue for phones with small batteries.Wolfpup - Monday, March 31, 2014 - link
That's an amazing chip...so much better than I was expecting. I do wonder what it is Apple's planning on doing with them...Also wish every iOS device since the third or fourth year had at least 2x the RAM. It's surprising it works as well as it does, but it's still horribly RAM starved. These current devices should have at least 2GB, and 4GB would be more reasonable than 1 IMO.
name99 - Monday, March 31, 2014 - link
Where does Apple go? Frequency scaling is obvious but uninteresting.More interesting is that there remains an awful lot that we don't know about the micro-architecture, and which may have room for improvement. This includes, for example,
- quality of the branch prediction (including indirect branches)
- quality of the pre fetchers
- quality of the load/store pipeline, including things like how many stores can be queued up pending completion, how aggressively loads are moved past stores, and how fast misprediction is recovered
Personally I think big changes in any of these areas are for the A9. For the A8, what needs to be improved is the uncore: Apple-controlled GPU much more tightly integrated with the CPUs, low latency (and larger) L3 shared by CPUs and GPU, ring gluing these all together. Even better (but maybe too aggressive for this iteration) would be a much smarter memory controller using some of the latest academic ideas like virtual write queue.
If you want aggressive scaling, at some point we'll probably get a third core added. MAYBE even with the A8, IF Apple go to 20nm. It seems to me that more than 3 cores is unnecessary right now (heck, there are few circumstances where even three would be used, unless they've made some great strides forward in parallelizing web browsing and PDF rendering), and that hyper threading is a distraction compared to other things they can do to improve the core that generate more bang for the buck.
General frequency scaling is, I think, less essential and less desirable than better turbo support. Don't design the CPU to run for a long time at 2.5GHz (for either power or overheating reasons) but allow it to hit those frequencies in brief bursts of a second or less to provide for higher snappiness.
Also, of course, much less sexy but coming will be the M8, Apple's designed ultra-power-power sensor controller core (as opposed to the M7 which is a rebranded third-party core). It would be interesting to know something about that, but it'll be a long time if ever before we learn anything beyond a few crumbs. (Apple probably will boast at the iPhone6 release about how little power it uses, and in just how many circumstances it can do the job without waking the main core.)
Swift2001 - Tuesday, April 1, 2014 - link
Why? Given the parsimonious way that Apple parcels out memory, doesn't allow you to keep more than one app open all the time -- somewhat changed right now, for background processes, audio streams and the like, and a stateful little stub to start up an app quickly, but not really "multi-tasking" like leaving some code to compile while you watch a movie and talk to your girlfriend-- then what would the extra RAM have done a) to speed and fluidity -- very little-- and battery life? RAM degrades battery quickly.