We only recently reported on the story that Amazon are designing a custom server SoC based on Arm’s Neoverse N1 CPU platforms, only for Amazon to now officially announce the new Graviton2 processor as well as AWS instances based on the new hardware.

AWS Re:Invent Event Twitter

The new Graviton2 SoC is a custom design by Amazon’s own in-house silicon design teams and is a successor to the first-generation Graviton chip. The new chip quadruples the core count from 16 cores to 64 cores and employs Arm’s newest Neoverse N1 cores. Amazon is using the highest performance configuration available, with 1MB L2 caches per core, with all 64 cores connected by a mesh fabric supporting 2TB/s aggregate bandwidth as well as integrating 32MB of L3 cache.  

Amazon claims the new Graviton2 chip is can deliver up to 7x higher performance than the first generation based A1 instances in total across all cores, up to 2x the performance per core, and delivers memory access speed of up to 5x compared to its predecessor. The chip comes in at a massive 30B transistors on a 7nm manufacturing node - if Amazon is using similar high density libraries to mobile chips (they have no reason to use HPC libraries), then I estimate the chip to fall around 300-350mm² if I was forced to put out a figure.

The memory subsystem of the new chip is supported by 8 DDR4-3200 channels with support for hardware AES256 memory encryption. Peripherals of the system are supported by 64 PCIe4 lanes.

Powered by the new generation processor, Amazon also detailed its new 6th generation instances M6g, R6g and C6g, offering various configuration up to the full 64 cores of the chip and up to 512GB of RAM for the memory optimised instance variants. 25Gbps “enhanced networking” connectivity, as well as 18Gbps bandwidth to EBS (Elastic Block Storage).

Amazon is also making some very impressive benchmark comparisons against its fifth-generation instances, supporting Intel Xeon Platinum 8175 processor of up to 2.5GHz:

  • All of these performance enhancements come together to give these new instances a significant performance benefit over the 5th generation (M5, C5, R5) of EC2 instances. Our initial benchmarks show the following per-vCPU performance improvements over the M5 instances:
  • SPECjvm® 2008: +43% (estimated)
  • SPEC CPU® 2017 integer: +44% (estimated)
  • SPEC CPU 2017 floating point: +24% (estimated)
  • HTTPS load balancing with Nginx: +24%
  • Memcached: +43% performance, at lower latency
  • X.264 video encoding: +26%
  • EDA simulation with Cadence Xcellium: +54%

Amazon is making M6g instances with the new Graviton2 processor available for CPU for non-production workloads, with expected wider rollout in 2020.

The announcement is a big win for Amazon and especially for Arm’s endeavours in the server space as they try to surpass the value that the x86 incumbents are able to offer. Amazon describes that the new 6g instances are able to offer 40% higher performance/$ than the existing x86 5th generation platforms, which represents some drastic cost savings for the company and its customers.

Related Reading:

Comments Locked


View All Comments

  • Gondalf - Friday, December 6, 2019 - link

    TSMC 5nm do not give such area advantages over 7nm to allow Poseidon. 5nm is more like an half node.
  • mode_13h - Tuesday, December 3, 2019 - link

    Just to nit pick, Purley is Intel's LGA 3647-based platform spec - not the core uArch or anything like that.
  • techbug - Friday, December 6, 2019 - link

    how per vCPU calculated is totally over my head. Is it total-score on intel processor divided by the number of hardware thread (96, 2 * 48 threads/socket) compared against ARM processor score divided by (128, 2* 64threads/socket) ?
  • name99 - Tuesday, December 3, 2019 - link

    How many people buying AWS services care about latency rather than throughput?
    Sure, you need to hit a minimum per-core performance level, but once that's achieved what matters is the throughput/dollar (including eg rack volume and watts).

    Judging a design like this by metrics appropriate to the desktop is just silly.
  • ksec - Wednesday, December 4, 2019 - link

    It doesn't matter, you get 1 thread on Intel vCPU, you get 1 Core per ARM vCPU . The unit are the same. Not to mention a lot of Clients and Workload likes to have HT disabled.

    As long as the 1 ARM vCPU is cheaper, ( which it is ), and provides comparable performance ( which it does, according to AWS it is 30% faster than a Single Thread 3.1Ghz Skylake ) then it is all that matters.
  • Sychonut - Tuesday, December 3, 2019 - link

    Now imagine this, but on 14++++++.
  • name99 - Tuesday, December 3, 2019 - link

    The numbers seem a bit strange, Andrei. I assume we all agree that, while this is a nice step forward in the ARM server space, the individual cores are no Lightning's.
    So let's look at area; TSMC 7nm so basically like with like:

    IF one chip has 32 cores (per yesterday's article) then one core (+support ie L3 etc) is ~10mm^2.
    Meanwhile Apple is about 16mm^2 (eyeballing it as about 1/6th of the die for 2 large+small cores,+ L2s + large system cache).
    So Apple seems to be getting a LOT more out of their die... Even put aside the small cores and their per big core (+LOTS of cache) is ~8mm^2.

    Of course DRAM PHYs take some space, but mainly around the edges.
    So possibilities?
    - 64 cores on the die, not 32? AND/OR
    - LOTS of IO? A few ethernet phy's, some flash controllers, some USB and PCIe?
    - lost of the die devoted to GPU/NPU?

    The only way I can square it is likely all three are true. Half the die is IO+GPU/NPU (which gets us to 5mm^2/core) AND there are actually 64 cores? WikiChip says an N1+L2 is supposed to be around 1.4mm^2 on 7nm, so throw in L3 and the numbers kinda work out.
  • ksec - Wednesday, December 4, 2019 - link

    They are 32 Core, not 64

    I/O takes up more space and does not scale well with node changes. Yes. There are lot of I/O needs for Server, especially PCI-E lanes.
  • Wilco1 - Wednesday, December 4, 2019 - link

    The chip has 64 cores, 8 DDR interfaces and 64 PCI lanes.

    I can't see the confusion about core count, a 48-core Centriq has 18 Billion transistors, this has 30 for 64.
  • name99 - Wednesday, December 4, 2019 - link

    The "confusion" is that this article
    claimed 32 cores.

    And it's not a confusion, it's an attempt to confirm various points that would appear to be obvious (the number of cores, the amount of IO, AND --- you left this out --- the amount of non-CPU logic [GPU or NPU]) but which were omitted by this article or, apparently, simply incorrect in an earlier article.

Log in

Don't have an account? Sign up now