понедельник, 30 октября 2017 г.

x86 4+ sockets systems, CPU interconnect, and why Bull


The advantages of  a “glueless” architecture:
  • no requirement for specific development nor expertise from the server manufacturer. Every server makers can build a 8-socket server.
  • thus the cost of a 4-socket and 8-socket is also less
The disadvantages of a “glueless” architecture:
  • the TCO goes up when scaling out
  • limited to 8-socket servers
  • difficult to maintain cache coherency when socket increases
  • performance increase not linear
  • price/performance ratio decreases
  • efficiency not optimal when running large VMs
  • up to 65% of Intel QPI links bandwidth consumed to address QPI source broadcast snoopy protocol
The primary workloads concerned by the Intel QPI source broadcast snoopy issue are:
  • Java applications
  • large databases
  • latency sensitive applications
Useful reading about snoopy protocols(MESI/MESIF/QPI):

Glued architecture

1) BULL BCS2 - 2 hops, 250ns

BCS2 provides 7 XQPI links to connect to up 7 others modules in order to build a maximum of 16-socket system
Bandwidth :
• 1 XQPI link : 14 GT/s each direction
• 1 Transfer = 2 bytes => 14 GT/s = 224 Gb/s
• Transfer rate between 2 modules:
• 4 sockets (4 XQPI links) : equivalent to ~88x 10 GigE ports
• 8 sockets (2 XQPI links) : equivalent to ~44x 10 GigE ports
• 16 sockets (1 XQPI link) : equivalent to ~22x 10 GigE ports

2) HPE Superdome X sx3000 crossbar - up to 8 hops, 486 ns

3)SGI NUMAlink 7 - up to 500ns

4)Huawei KunLun  - NCM > 2 hops

5) Ubox XNC - eXternal Node Controller/UNC - 3rd generation of BCS - Bullion Sequana S1600 - S3200 . The UBox is a 5U chassis imbedding several UPI Node Controllers (UNC). The UNC is the 6th generation of eXternal Node Controller (XNC) designed and developed by Atos for Intel processor-based servers. It is a VLSI-type (Very Large-Scale Integration) integrated circuit derived from mainframe technologies and tuned for High Performance Computing. This innovative and unique Atos technology makes it possible to interconnect up to sixteen 2-socket modules allowing to go up to 32-socket SMP systems in a Cache Coherent Non-Uniform Memory Access (CCNUMA) architecture.

up to 8 nodes - classic Intel glueless -

8+ nodes
- with Ubox - 4+ socket server with Intel Xeon Gold processors
- transparent CascadeLake support.
2 hops in 16-socket system Bullion Sequana S1600, full bandwidth.

Topology: full mesh

Frankly, it is not definetely 3rd generation of BCS. It is third-party solution, product of Numascale.

Core technologies: SCI - Scalable Coherent Interconnect.

Single socket, single rail

Dual socket, single rail

Dual socket, single rail

Quad socket

Single Chassis

Dual Chassis

Quad Chassis Topology

Eight Chassis Topology


8 Sockets, Double and Single data planes

To meet customer application requirements, 2 types of UBox models can be proposed:
• Enterprise: this is the standard configuration providing all-to-all topology between CPUs. It provides both the performance and the high availability needed for high memory demanding applications like SAP HANA.
• High Performance: well suited for High Performance Computing, doubling the bandwidth in the all-to-all topology between CPUs. It provides exceptional performance for intensive CPU workload. The UBox is autonomous in term of power, cooling and local management.

Entreprise mode:

High performance mode:

6) HPE Superdome Flex/Numalink8 - up to 32sockets -  400ns

Advantages - 4+ socket server with Intel Xeon Gold processors

 210 GB/s of bi-sectioned crossbar bandwidth at 8-sockets
 425+ GB/s at 16-sockets
 850+ GB/s at 32-sockets

Disadvantages - no transparent CascadeLake support.
HPE SuperdomeFlex Figure 1.jpg
Superdome Flex ASIC.jpg

Glueless architecture

5)Intel Xeon scalable 2-8 sockets topologies - up to 2 hops, with affected bandwidth.

6) Lenovo 8-socket topology - up to 2 hops, with affected bandwidth.