NUMA

Tag Archive for NUMA

Identify VMware CPU Load Balancing Techniques

January 31, 2013 Objective 3 Tuning and Optimisation No comments

The VMKernel CPU scheduler is crucial to providing good performance in a consolidated environment. Most processors these days are equipped with multiple cores per processor and controlling, managing and scheduling these multi way processors is essential. It assigns execution contexts to processors

The CPU Scheduler

The CPU Scheduler has the following features

Schedules the vCPUs on physical CPUs
Enforces the proportional-share algorithm for CPU usage
Supports SMP VMs
Uses relaxed co-scheduling for SMP VMs
Uses NUMA
Processor Topology/Cache aware
Hyperthreading

Schedules the vCPUs on physical CPUs

The Scheduler checks physical utilisation every 2-40ms and migrates vCPUs as necessary

Enforces the proportional-share algorithm for CPU usage

When CPUs are over-committed, hosts time slice physical CPUs across all VMs where each CPU is also prioritised by resource allocation settings in terms of Shares, Reservations and Limits)

Supports SMP VMs

If a VM is configured with multiple processors then it believes that it is running on a dedicated physical multiprocessor. ESXi maintains this by using co-scheduling of the vCPUs.

Co-Scheduling is a technique for scheduling, descheduling, preempting and blocking transactions across multiple processors. Without it, vCPUs would be scheduled independently, breaking the guests assumption regarding uniform process.

The CPU Scheduler takes “Skew” into account when scheduling vCPUs. Skew is the difference in execution rates between 2 or more vCPUs in an SMP VM. The Scheduler maintains a fine grained cumulative skew value for each vCPU in a VM. Time spent in the hypervisor is excluded from the process as sometimes the operations do not benefit from being co-scheduled. The vCPU is considered to be skewed if its cumulative skew value exceeds a configurable threshold, usually a few seconds

Uses relaxed co-scheduling for SMP VMs

Relaxed Co-Scheduling refers to a technology where vCPUs have become skewed and must be co-started. When any vCPU is scheduled, it ensures that all other vCPUs that are behind will also be scheduled

The vCPUs that move too far forward are stopped and wait for the other VMs to catch up. An idle vCPU does not gather skew and is classed as if it was running normally

Uses NUMA

Please see this blog post for more information on NUMA

http://www.electricmonk.org.uk/2012/03/01/numa/

Processor Topology/Cache aware

Basically the CPU Scheduler uses Processor Topology information to calculate and optimise the placement of vCPUs on to different sockets using socket, core and logical processor information

The CPU Scheduler also takes advantage of the Shared Last Level Cache which exists within cores on the same processor. This is a memory cache that has a dedicated channel to a CPU socket bypassing the main memory bus which makes it run at the same speed of the CPU

In some situations the CPU scheduler will spread the load across all sockets and sometimes it can be beneficial to schedule all vCPus on to the same socket. Dependent on workload and over/under committed systems

Hyperthreading

The applications most likely to benefit are 3D rendering programs, heavy-duty audio/video transcoding apps, and scientific applications built for maximum multi-threaded performance. But you may also enjoy a performance boost when encoding audio files in iTunes, playing 3D games and zipping/unzipping folders. The boost in performance can be up to 30%, although there will also be situations where Hyper-Threading provides no boost at all.

Hyper-Threading is where two threads are able to run on one single-threaded core. When a thread on the core in question is stalling or in a halt state, hyper-threading enables the core to work on a second thread instead. It makes the OS think that the processor has double the number of cores, and often yields a performance improvement

March 1, 2012 IT No comments

Intro

For the past decade, processor clock speed has skyrocketed at rates exceeding even the predictions of Moores Law. A multi Gigahertz CPU, however needs to be supplied with an enormous amount of memory bandwidth in order to do its processing effectively.

Even a single CPU running a memory intensive workload such as a scientific computing application, can find itself constrained by memory bandwidth

These problems are amplified many times over on symmetric multiprocessing (SMP) systems where many processors must compete for bandwidth on the same system bus.

What is NUMA?

NUMA is Non Uniform Memory Access

NUMA is an alternative approach that links several small, cost-effective nodes via a high performance interconnect. Each node contains both processors and memory, much like a small SMP system. However, an advanced memory controller allows a node to use memory on all other nodes, creating a single system image. When a processor accesses memory that does not lie within its own node (remote memory), the data must be transferred over the NUMA interconnect, which is slower than accessing local memory. Thus, memory access times are “non-uniform,” depending on the location of the memory, as the technology’s name implies

So what does Non-Uniform Memory Access really mean?

Non-Uniform Memory Access means that it will take longer to access some regions of memory than others. This is due to the fact that some regions of memory are on physically different busses from other regions

Imagine that you are baking a cake. You have a group of ingredients (=memory pages) that you need to complete the recipe(=process). Some of the ingredients you may have in your cabinet(=local memory), but some of the ingredients you might not have, and have to ask a neighbor for(=remote memory). The general idea is to try and have as many of the ingredients in your own cabinet as possible, since this reduces your time and effort in making the cake.
You also have to remember that your cabinets can only hold a fixed amount of ingredients(=physical nodal memory). If you try and buy more, but you have no room to store it, you may have to ask your neighbor to keep it in his/her cabinet until you need it(=local memory full, so allocate pages remotely).

What is meant by Local and Remote Memory?

The terms local memory and remote memory are typically used in reference to a currently running process. That said, local memory is typically defined to be the memory that is on the same node as the CPU currently running the process. Any memory that does not belong to the node on which the process is currently running is then, by that definition, remote.
Local and remote memory can also be used in reference to things other than the currently running process. When in interrupt context, there technically is no currently executing process, but memory on the node containing the CPU handling the interrupt is still called local memory. Also, you could use local and remote memory in terms of a disk. For example if there was a disk (attatched to node 1) doing a DMA, the memory it is reading or writing would be called remote if it were located on another node (ie: node 0)

What is the difference between NUMA and SMP?

The NUMA architecture was designed to surpass the scalability limits of the SMP architecture. With SMP, which stands for Symmetric Multi-Processing, all memory access are posted to the same shared memory bus. This works fine for a relatively small number of CPUs, but the problem with the shared bus appears when you have dozens, even hundreds, of CPUs competing for access to the shared memory bus. NUMA alleviates these bottlenecks by limiting the number of CPUs on any one memory bus, and connecting the various nodes by means of a high speed interconnect.

Why should I use NUMA? What are the benefits of NUMA?

The main benefit of NUMA is, as mentioned above, scalability. It is extremely difficult to scale SMP past 8-12 CPUs. At that number of CPUs, the memory bus is under heavy contention. NUMA is one way of reducing the number of CPUs competing for access to a shared memory bus. This is accomplished by having several memory busses and only having a small number of CPUs on each of those busses. There are other ways of building massively multiprocessor machines

Issues

The high latency of remote memory accesses can leave the processors under-utilized, constantly waiting for data to be transferred to the local node and the NUMA interconnect can also become a bottleneck for applications with high memory bandwidth demands.

Furthermore, performance on such a system may be highly variable, for example, if an application has memory located locally on one benchmarking run, but a subsequent run happens to place all that memory on a remote node. This phenomenon can make capacity planning much more difficult. Finally processor clocks may not be synchronised between multiple nodes so applications that read this clock directly may be behave incorrectly.

Typical Four-Processor NUMA Node Architecture

High-end servers are designed to support more than one system bus. One design approach is to create a number of nodes where each node contains some processors, some memory, and, in some cases, an I/O subsystem as per below pic

Two Four-Processor NUMA Nodes Connected as an Eight-Processor NUMA System

To increase system capacity, additional nodes are connected using the high-speed cache-coherent system interconnect, as shown

In the diagram, all eight processors can access memory in both nodes coherently. For example:

A processor in Node 1 can access memory within Node 1, (that is, local or “near” memory) using a direct path through the memory controller in Node 1.
For the same processor to access memory in Node 2 (that is, “remote” or “far” memory), the path taken is through the memory controller in Node 1, out through the system interconnect, and then through the memory controller in Node 2.

It takes more time to access memory in another node than it takes to access local memory. This difference in memory access times is the origin of the name for these systems: non-uniform memory architecture (NUMA).

The ratio of the time taken to access near memory to the time taken to access far memory is referred to as the NUMA ratio. The higher the NUMA ratio value — that is, the greater the disparity between the time it takes to access far memory as compared to near memory — the greater the effect that NUMA characteristics may have on software performance.

3:1 being an optimal ratio

Search

Search for:
Calendar

July 2025

M T W T F S S

1 2 3 4 5 6

7 8 9 10 11 12 13

14 15 16 17 18 19 20

21 22 23 24 25 26 27

28 29 30 31

« Jan
Social Media and RSS
vExpert
Recent Posts
- What’s occurring with slack space in vSAN8? January 3, 2025
- Introduction to Artificial Intelligence November 12, 2024
- Windows Virtualization Based Security June 8, 2024
- SNMP explained January 29, 2023
- Using tcpdump December 14, 2022
Archives
- January 2025 (1)
- November 2024 (1)
- June 2024 (1)
- January 2023 (1)
- December 2022 (1)
- August 2022 (1)
- February 2022 (2)
- October 2021 (1)
- July 2021 (1)
- May 2021 (1)
- March 2021 (1)
- February 2021 (1)
- January 2021 (1)
- December 2020 (1)
- November 2020 (2)
- October 2020 (1)
- August 2020 (1)
- July 2020 (2)
- June 2020 (2)
- April 2020 (1)
- March 2020 (2)
- December 2019 (1)
- November 2019 (2)
- August 2019 (1)
- July 2019 (1)
- May 2019 (1)
- April 2019 (1)
- February 2019 (1)
- January 2019 (2)
- December 2018 (1)
- November 2018 (1)
- October 2018 (1)
- August 2018 (1)
- June 2018 (1)
- April 2018 (1)
- March 2018 (1)
- January 2018 (1)
- November 2017 (1)
- October 2017 (1)
- September 2017 (1)
- August 2017 (1)
- June 2017 (1)
- May 2017 (1)
- April 2017 (2)
- March 2017 (1)
- February 2017 (1)
- January 2017 (1)
- December 2016 (1)
- November 2016 (2)
- September 2016 (1)
- August 2016 (2)
- July 2016 (2)
- May 2016 (2)
- February 2016 (3)
- January 2016 (3)
- December 2015 (3)
- November 2015 (1)
- October 2015 (2)
- September 2015 (2)
- August 2015 (2)
- July 2015 (2)
- June 2015 (3)
- May 2015 (2)
- April 2015 (1)
- March 2015 (2)
- February 2015 (2)
- January 2015 (2)
- December 2014 (3)
- November 2014 (2)
- October 2014 (1)
- September 2014 (2)
- August 2014 (2)
- July 2014 (2)
- June 2014 (3)
- May 2014 (1)
- April 2014 (1)
- March 2014 (6)
- February 2014 (2)
- January 2014 (2)
- December 2013 (1)
- November 2013 (3)
- October 2013 (5)
- September 2013 (1)
- August 2013 (2)
- July 2013 (6)
- June 2013 (4)
- May 2013 (5)
- April 2013 (4)
- March 2013 (28)
- February 2013 (53)
- January 2013 (63)
- December 2012 (13)
- November 2012 (11)
- October 2012 (13)
- September 2012 (6)
- August 2012 (16)
- July 2012 (22)
- June 2012 (16)
- May 2012 (19)
- April 2012 (10)
- March 2012 (13)
- February 2012 (32)
- January 2012 (25)
Categories
- AI (1)
- Benchmarking (3)
- Certification (164)
  - Microsoft (1)
  - VCAP5 DCA (155)
    - Objective 1 Storage (33)
    - Objective 2 Networking (21)
    - Objective 3 Tuning and Optimisation (26)
    - Objective 4 Business Continuity (11)
    - Objective 5 Operational Maintenance (15)
    - Objective 6 Advanced Troubleshooting (28)
    - Objective 7 Secure a vSphere environment (15)
    - Objective 8 Perform Scripting and Automation (8)
    - Objective 9 Advanced vSphere Installation (8)
  - VCP5-DCV (1)
- Cisco (1)
- Command Line (8)
  - PowerCLI (3)
  - Robocopy (1)
  - vCLI (1)
- Flex 10 (1)
- FreeNas (2)
- IMM/RSA (1)
- IPv6 (2)
- IT (39)
- Kubernetes (1)
- McAfee Products (1)
- microsoft (89)
  - Active Directory (7)
  - ActiveSync (1)
  - App-V (2)
  - Auditing (1)
  - BgInfo (1)
  - CA (1)
  - Clustering (6)
    - Microsoft Failover Clustering (2)
    - SQL Failover Clustering (1)
  - DFS (5)
  - DHCP (1)
  - Disk Quotas (1)
  - DNS (1)
  - Excel (1)
  - Forest Trusts (1)
  - Group Policy (3)
  - Hyper V (1)
  - Kerberos (1)
  - NAP (1)
  - Networking (2)
  - NLB (2)
  - NTFS (1)
  - Performance (3)
  - PowerShell (5)
  - Process Explorer (1)
  - Registry Mods (1)
  - RemoteApp (1)
  - Roaming Profiles (3)
  - ROUTE Command (1)
  - Shrinking Drives (1)
  - SQL Server (6)
  - System Volume Information (1)
  - Technet Labs (1)
  - Terminal Services (7)
  - Time Synchronisation (1)
  - UAC (1)
  - Upgrading Windows Editions (1)
  - WFAS (1)
  - Windows Firewall (1)
  - Windows Server 2012 (9)
  - XP Mode (1)
- Mobile Telephony (1)
- Networking (3)
- Oracle RAC (1)
- Personal (8)
- Security (1)
- SNMP (1)
- SRM (1)
- Storage (6)
- Technology (2)
- VdBench (1)
- Viso (1)
- VMware (206)
  - Active Directory Integration (1)
  - AD LDS (1)
  - Antivirus (1)
  - AutoDeploy (4)
  - Autolab (1)
  - Blogs (1)
  - Certificates (1)
  - Cloning (2)
  - Cluster Admission Control (1)
  - Clustering (3)
  - Compatibility Guide (1)
  - Database (6)
  - Documentation Center (4)
  - DRS (1)
  - ESXTOP + RESXTOP (4)
  - EVC (1)
  - F5 Load Balancer (1)
  - Guest O/S Customization (1)
  - HA (7)
  - Host profiles 6.5 (2)
  - iPad Knowledge App (1)
  - Labs (1)
  - Licensing (2)
  - Logs (3)
  - Monitoring (15)
  - NetFlow (1)
  - Networking (10)
  - NLB (1)
  - PowerCLI (4)
  - PSA and NMP (1)
  - RDM (1)
  - Resource Pools (1)
  - SMP (2)
  - Snapshots (1)
  - SSO (2)
  - Storage (32)
  - Time Syncing (1)
  - TPS (1)
  - UMDS (2)
  - Upgrading (3)
  - USB Devices (1)
  - VAAI (1)
  - vApps (1)
  - VASA (1)
  - vCenter (7)
  - vCheck (1)
  - vCLI (1)
  - VCSA 6.5 (2)
  - vMA (1)
  - vMotion (2)
  - VMware Labs (1)
  - VMware Tools (1)
  - VMware View (1)
  - vRA (13)
    - F5 Load Balancer with vRA (1)
    - vRA Certificates (1)
    - vRA Distributed Deployment v6.2.3 (3)
      - Part 1 (1)
      - Part 2 (1)
      - Part 3 (1)
    - vRA Small Deployment v6.2.3 (7)
      - Part 1 (1)
      - Part 2 (1)
      - Part 3 (1)
      - Part 4 (1)
      - Part 5 (1)
      - Part 6 (1)
      - Part 7 (1)
    - vRA7 (1)
      - vRA7 Minimal Deployment (1)
  - vRealize Log Insight (3)
    - Management Packs (1)
    - vCO Monitoring (1)
  - vROps (1)
    - Replacing Certificates (1)
  - VSA (1)
  - vSAN (8)
    - HCIBench (3)
    - vSAN Stretched Cluster (1)
  - vSphere 6 (10)
    - Decommissioning vCenter and PSC (1)
    - HTML5 Web Client (1)
    - JXplorer (1)
    - Platform Services Controller (5)
      - Enhanced Linked Mode (1)
      - High Availability (1)
      - Multisite (3)
    - PSC Replication (1)
    - Registering Orchestrator in vSphere (1)
  - vSphere Web Client (1)
- Web (1)
Tags

2012 ad AutoDeploy Certificate certification cluster Clustering DB DFS DRS esxi firewall gpo HA I/O iSCSI Labs logs LUN memory Microsoft NIC Performance powercli powershell PSA PSC RDM RDS sql storage Storage vMotion troubleshooting tuning upgrade vCenter vDS vm VMDK VMware vRA vro VSAN vsphere6 vSS
Fatcow Webhosting

Tag Archive for NUMA

Identify VMware CPU Load Balancing Techniques

NUMA

Electric Monk

Don't think about what can happen in a month. Don't think what can happen in a year. Just focus on the 24 hours in front of you and do what you can to get closer to where you want to be :-)

Search

Calendar

Social Media and RSS

vExpert

Recent Posts

Archives

Categories

Tags

Fatcow Webhosting