воскресенье, 19 марта 2017 г.

Some remarks for SAP AS ABAP single thread performance on Intel Xeon E5-26xx v3/v4 and Power8

Source material for this post was observation of single thread performance SAP kernel 7.49 PL100 on different platforms. I have 3 platforms for tests – Intel Xeon E5-2670v3, Intel Xeon E5-2643v4 and Power8(10-core 3.42 GHz POWER8 Processor Card). I have single thread task – ABAP program for generating month detailed general ledger report, so, this post explain single thread execution of ABAP on different generations of cpu(Intel)/architectures(Power).
So, performance profile for my task is(screen for Intel E5-2643v4):
Test performed on 3 different instances – PAS on E5-2670v3(HT enabled), AAS on E5-2643v4(HT enabled) and Power8.
Start conditions;
Software: PAS/AAS:
Intel platform:
  •  VMware 6.0
  •  RHEL 6.8
  •  SAP kernel 7.49 PL100
Power8 platform:
  •  SLES 11 SP4
  •  SAP kernel 7.49 PL100
HANA appliance:
  •  RHEL 6.5
  •  HANA PLATFORM Ed.1.0, SPS10
Intel platform: 8vCPU, 64Gb vRAM (physical: Dell 730xD and M630)
Power8 platform: 2xPOWER8 Processor Card, 256Gb RAM (S822)
HANA database works on certified appliance Dell R930 (1,5Tb RAM, dual E7-8880v3).
AAS on Power8 works on LPAR with full resources of physical server.
As of start conditions and single thread task, to speed up the execution, i need more frequency on core and larger cache for data(all levels), to minimize CPU-RAM  exchange. All AS on Intel platform works in virtualized environment, so i need to consider vNUMA and VM settings to avoid or minimize NUMA node interleaving.
Very little excursus for NUMA/vNUMA:
In ESXi 5, virtual NUMA, or vNUMA was introduced. vNUMA exposes the NUMA topology
of the ESXi host to the guest OS of the VM. This enables NUMA-aware guest OSs and applications to make the most efficient use of the underlying hardware’s NUMA architecture. This can
provide significant performance benefits when virtualizing OSs and applications that make use
of NUMA optimizations.When a VM is running on an ESXi 5 host and has been upgraded to virtual hardware version 8 or above, the VM will be able to make use of vNUMA.
What we must to consider:
For NUMA scheduling purposes, a NUMA client (Physical Proximity Domain (PPD)). is created per virtual machine and assigned a NUMA home node.A Virtual Proximity Domain (VPD) is presented to the guest as the NUMA node. We must avoid spanning VPDs across PPDs, if possible.
If you are not sure with right manual NUMA layout configuration, you can set N virtual sockets with 1 core, and use this parameter:
  • numa.autosize
By default NUMA optimization does not count the HTs when determining if the virtual machine could fit inside the NUMA home node. You can manage it with parameter:
  • numa.vcpu.preferHT
Note that the NUMA scheduler is focused on consuming as much local memory as possible, it tries to avoid consuming remote memory. vNUMA is actually enabled only for a virtual machine with 9 or more vCPUs. This policy can be overridden with the following advanced virtual machine attribute
  • numa.vcpu.min
Numa alignment for non-wide VMs: wherever possible, around even multiples of the NUMA node. For example, on a hex-core system, use to 2, 3, or 6 cores.
The NUMA/vNUMA area much larger, so, i think to explain it at next article of exploitation SAP applications on VMware environment.

So, going back to our task: for E5-2670v3 we haven’t problems with NUMA, because AS not wide-vm machine, only 8 vcpu on 12 core pCPU sockets:
But on E5-2643v4 we have 6-core pCPUs and we see 2 NUMA nodes for our VM:
To place VM to one NUMA node for avoiding interleaving i need to enable preferHT option:
For Power8 system i enabled Fixed Maximum Frequency Mode with 3.6Ghz per core and SMT=off.
For E5-26xx systems i enabled High Performance mode for host. Also, for E5-2643v4  i have 27,5% difference between results with preferHT option, because 8vcpu VM executed on 2 NUMA nodes with interleaving (wp can be changed during execution, but virtual memory pages don’t move so quick  between NUMA nodes as vcpu executions – refer to vsPhere cpu scheduler document)- 80 minutes execution against 58 minutes on single NUMA node system.
And also there are same recommendations to align SAP instances in vSphere environment:

All statistics for Power8 system collected with nmon utility – you can find materials for use and buid reports here:
Power8 cpu load – no waits:
IO wait time of dialog processes during execution on Power8:
CPU time of dialog processes during execution on Power8:
Final result table – E5-2643v4 is leader. Yes, cоmparison is not honest, because i ignore NUMA placing of AAS on Power8 platform, but, according to better bandwith between cpu-ram on Power8 system, effect can’t be sagnifiant : max memory bandwith on E5-2643v4 –  76.8 GB/s vs 204 GB/S on Power8.
Other tested  t-codes are not so abap/cpu intensive, their metrics have only informational disposition.
Overall conclusion – unfortunately, there is no any advantages in single thread abap execution on Power8 platform, because now available 6,8 physical core high-frequenced processors(E5-2643v4, E5-1680v4).

Комментариев нет:

Отправить комментарий