Archive for IT

NUMA

Intro

For the past decade, processor clock speed has skyrocketed at rates exceeding even the predictions of Moores Law. A multi Gigahertz CPU, however needs to be supplied with an enormous amount of memory bandwidth in order to do its processing effectively.

Even a single CPU running a memory intensive workload such as a scientific computing application, can find itself constrained by memory bandwidth

These problems are amplified many times over on symmetric multiprocessing (SMP) systems where many processors must compete for bandwidth on the same system bus.

What is NUMA?

NUMA is Non Uniform Memory Access

NUMA is an alternative approach that links several small, cost-effective nodes via a high performance interconnect. Each node contains both processors and memory, much like a small SMP system. However, an advanced memory controller allows a node to use memory on all other nodes, creating a single system image. When a processor accesses memory that does not lie within its own node (remote memory), the data must be transferred over the NUMA interconnect, which is slower than accessing local memory. Thus, memory access times are “non-uniform,” depending on the location of the memory, as the technology’s name implies

So what does Non-Uniform Memory Access really mean?

Non-Uniform Memory Access means that it will take longer to access some regions of memory than others. This is due to the fact that some regions of memory are on physically different busses from other regions

Imagine that you are baking a cake. You have a group of ingredients (=memory pages) that you need to complete the recipe(=process). Some of the ingredients you may have in your cabinet(=local memory), but some of the ingredients you might not have, and have to ask a neighbor for(=remote memory). The general idea is to try and have as many of the ingredients in your own cabinet as possible, since this reduces your time and effort in making the cake.
You also have to remember that your cabinets can only hold a fixed amount of ingredients(=physical nodal memory). If you try and buy more, but you have no room to store it, you may have to ask your neighbor to keep it in his/her cabinet until you need it(=local memory full, so allocate pages remotely).

What is meant by Local and Remote Memory?

The terms local memory and remote memory are typically used in reference to a currently running process. That said, local memory is typically defined to be the memory that is on the same node as the CPU currently running the process. Any memory that does not belong to the node on which the process is currently running is then, by that definition, remote.
Local and remote memory can also be used in reference to things other than the currently running process. When in interrupt context, there technically is no currently executing process, but memory on the node containing the CPU handling the interrupt is still called local memory. Also, you could use local and remote memory in terms of a disk. For example if there was a disk (attatched to node 1) doing a DMA, the memory it is reading or writing would be called remote if it were located on another node (ie: node 0)

What is the difference between NUMA and SMP?

The NUMA architecture was designed to surpass the scalability limits of the SMP architecture. With SMP, which stands for Symmetric Multi-Processing, all memory access are posted to the same shared memory bus. This works fine for a relatively small number of CPUs, but the problem with the shared bus appears when you have dozens, even hundreds, of CPUs competing for access to the shared memory bus. NUMA alleviates these bottlenecks by limiting the number of CPUs on any one memory bus, and connecting the various nodes by means of a high speed interconnect.

Why should I use NUMA? What are the benefits of NUMA?

The main benefit of NUMA is, as mentioned above, scalability. It is extremely difficult to scale SMP past 8-12 CPUs. At that number of CPUs, the memory bus is under heavy contention. NUMA is one way of reducing the number of CPUs competing for access to a shared memory bus. This is accomplished by having several memory busses and only having a small number of CPUs on each of those busses. There are other ways of building massively multiprocessor machines

Issues

The high latency of remote memory accesses can leave the processors under-utilized, constantly waiting for data to be transferred to the local node and the NUMA interconnect can also become a bottleneck for applications with high memory bandwidth demands.

Furthermore, performance on such a system may be highly variable, for example, if an application has memory located  locally on one benchmarking run, but a subsequent run happens to place all that memory on a remote node. This phenomenon can make capacity planning much more difficult. Finally processor clocks may not be synchronised between multiple nodes so applications that read this clock directly may be behave incorrectly.

Typical Four-Processor NUMA Node Architecture

High-end servers are designed to support more than one system bus. One design approach is to create a number of nodes where each node contains some processors, some memory, and, in some cases, an I/O subsystem as per below pic


Two Four-Processor NUMA Nodes Connected as an Eight-Processor NUMA System

To increase system capacity, additional nodes are connected using the high-speed cache-coherent system interconnect, as shown

In the diagram, all eight processors can access memory in both nodes coherently. For example:

  • A processor in Node 1 can access memory within Node 1, (that is, local or “near” memory) using a direct path through the memory controller in Node 1.
  • For the same processor to access memory in Node 2 (that is, “remote” or “far” memory), the path taken is through the memory controller in Node 1, out through the system interconnect, and then through the memory controller in Node 2.

It takes more time to access memory in another node than it takes to access local memory. This difference in memory access times is the origin of the name for these systems: non-uniform memory architecture (NUMA).

The ratio of the time taken to access near memory to the time taken to access far memory is referred to as the NUMA ratio. The higher the NUMA ratio value — that is, the greater the disparity between the time it takes to access far memory as compared to near memory — the greater the effect that NUMA characteristics may have on software performance.

3:1 being an optimal ratio

Intel-VT and AMD-V Technology

Early virtualization efforts relied on software emulation to replace hardware functionality. But software emulation can be a slow and inefficient process. Because many virtualization tasks were handled through software, VM behavior and resource control were often poor, resulting in unacceptable VM performance on the server.

Processors lacked the internal microcode to handle intensive virtualization tasks in hardware. Both Intel Corp. and AMD addressed this problem by creating processor extensions that could offload the repetitive and inefficient work from the software. By handling these tasks through processor extensions, traps and emulation of virtualization, tasks through the operating system were essentially eliminated, vastly improving VM performance on the physical server.

AMD

AMD-V (AMD virtualization) is a set of hardware extensions for the X86 processor architecture. Advanced Micro Dynamics (AMD) designed the extensions to perform repetitive tasks normally performed by software and improve resource use and virtual machine (VM) performance.

AMD Virtualization (AMD-V) technology was first announced in 2004 and added to AMD’s Pacifica 64-bit x86 processor designs. By 2006, AMD’s Athlon 64 X2 and Athlon 64 FX processors appeared with AMD-V technology, and today, the technology is available on Turion 64 X2, second- and third-generation Opteron, Phenom and Phenom II processors

Intel-VT

Intel VT (Virtualization Technology) is the company’s hardware assistance for processors running virtualization platforms.

Intel VT includes a series of extensions for hardware virtualization. The Intel VT-x extensions are probably the best recognized extensions, adding migration, priority and memory handling capabilities to a wide range of Intel processors. By comparison, the VT-d extensions add virtualization support to Intel chipsets that can assign specific I/O devices to specific virtual machines (VM)s, while the VT-c extensions bring better virtualization support to I/O devices such as network switches.

Three alternative techniques now exist for handling sensitive and privileged instructions to virtualize the CPU on the x86 architecture:

  1. Full virtualization using binary translation
  2. OS assisted virtualization or paravirtualization
  3. Hardware assisted virtualization (first generation)

Full virtualization using binary translation

X86 operating systems are designed to run directly on the bare-metal hardware, so they naturally assume they fully ‘own’ the computer hardware. As shown in the figure below, the x86 architecture offers four levels of privilege known as Ring 0, 1, 2 and 3 to operating systems and applications to manage access to the computer hardware

While user level applications typically run in Ring 3, the operating system needs to have direct access to the memory and hardware and must execute its privileged instructions in Ring 0. Virtualizing the x86 architecture requires placing a virtualization layer under the operating system (which expects to be in the most privileged Ring 0) to create and manage the virtual machines that deliver shared resources.
Further complicating the situation, some sensitive instructions can’t effectively be virtualized as they have different semantics when they are not executed in Ring 0. The difficulty in trapping and translating these sensitive and privileged instruction requests at runtime was the challenge that originally made x86 architecture virtualization look impossible.
VMware resolved the challenge in 1998, developing binary translation techniques that allow the VMM to run in Ring 0 for isolation and performance, while moving the operating system to a user level ring with greater privilege than applications in Ring 3 but less privilege than the virtual machine monitor in Ring 0.

OS Assisted Virtualization or Paravirtualization

“Para-“ is an English affix of Greek origin that means “beside,” “with,” or “alongside.” Given the meaning “alongside virtualization,” paravirtualization refers to communication between the guest OS and the hypervisor to improve performance and efficiency.
Paravirtualization, as shown  the picture below, involves modifying the OS kernel to replace nonvirtualizable instructions with hypercalls that communicate directly with the virtualization layer hypervisor. The hypervisor also provides hypercall interfaces for other critical kernel operations such as memory management, interrupt handling and time keeping. Paravirtualization is different from full virtualization, where the unmodified OS does not know it is virtualized and sensitive OS calls are trapped using binary translation. The value proposition of paravirtualization is in lower virtualization overhead, but the performance advantage of paravirtualization over full virtualization can vary greatly depending on the workload

Hardware assisted virtualization (first generation)

Going back to the first descriptions of the processors Hardware Assist capabilities – Hardware vendors are rapidly embracing virtualization and developing new features to simplify virtualization techniques. First generation enhancements include Intel Virtualization Technology (VT-x) and AMD’s AMD-V which both target privileged
instructions with a new CPU execution mode feature that allows the VMM to run
in a new root mode below ring 0. As depicted in the figure below, privileged and sensitive calls are set to automatically trap to the hypervisor, removing the need for either binary translation or paravirtualization. The guest state is stored in Virtual Machine Control Structures (VT-x) or Virtual Machine Control Blocks (AMD-V).
Processors with Intel VT and AMD-V became available in 2006, so only newer systems contain these hardware assist features.

Vmware Document describing Full Virtualization, Paravirtualisation and Hardware Assist

http://www.vmware.com/files/pdf/VMware_paravirtualization.pdf

Page Files

If there were no such thing as virtual memory, then once you filled up the available RAM your computer would have to say, “Sorry, you can not load any more applications. Please close another application to load a new one.”

With virtual memory, what the computer can do is look at RAM for areas that have not been used recently and copy them onto the hard disk. This frees up space in RAM to load the new application.

The read/write speed of a hard drive is much slower than RAM, and the technology of a hard drive is not geared toward accessing small pieces of data at a time. If your system has to rely too heavily on virtual memory, you will notice a significant performance drop. The key is to have enough RAM to handle everything you tend to work on simultaneously then, the only time you “feel” the slowness of virtual memory is is when there’s a slight pause when you’re changing tasks. When that’s the case, virtual memory is perfect.

When it is not the case, the operating system has to constantly swap information back and forth between RAM and the hard disk. This is called thrashing, and it can make your computer feel incredibly slow.

 The area of the hard disk that stores the RAM image is called a page file. It holds pages of RAM on the hard disk, and the operating system moves data back and forth between the page file and RAM. On a Windows machine, page files have a .SWP extension

On Linux it is a separate partition (i.e., a logically independent section of a HDD) that is set up during installation of the operating system and which is referred to as the swap partition.

A common recommendation is to set the page-file size at 1.5-times the system’s RAM. In reality, the more RAM a system has, the less it requires page files. You should base your page-file size on the maximum amount of memory your system is committing. Your page-file size should equal your system’s peak commit value (which covers the unlikely situation in which all the committed pages are written to the disk-based page files).

Locating the Page File (Windows)

Paging file configuration is in the System properties, which you can get to by typing “sysdm.cpl” into the Run dialog, clicking on the Advanced tab, clicking on the Performance Options button, clicking on the Advanced tab (this is really advanced), and then clicking on the Change button:

You’ll notice that the default configuration is for Windows to automatically manage the page file size.

Finding Committed Memory

In Windows XP and Server 2003, you can find the peak-commit value under the Task Manager Performance tab

However, this option wasn’t included in Windows Server 2008 and Vista. To determine Server 2008 and Vista peak-commit values, you have two options:

  1. Download Process Explorer from the Microsoft “Process Explorer v11.20” web page. Open the .zip file and double click procexp.exe. Click View on the toolbar and select System Information. Under Commit Charge (K), find the Peak value
  2. Use Performance Monitor to log the Memory – Committed Bytes counter, and review the log to find the Maximum value.

Make sure you run the server with all of its expected workloads to ensure it’s using the maximum amount of memory while you’re monitoring

Maximum Page File Sizes

Windows XP/2003

When that option is set on Windows XP and Server 2003,  Windows creates a single paging file that’s minimum size is 1.5 times RAM if RAM is less than 1GB, and 3 times RAM if it’s greater than 1GB, and that has a maximum size that’s three times RAM.

Windows Vista/2008

On Windows Vista and Server 2008, the minimum is intended to be large enough to hold a kernel-memory crash dump and is RAM plus 300MB or 1GB, whichever is larger. The maximum is either three times the size of RAM or 4GB, whichever is larger.

Limits

Limits related to virtual memory are the maximum size and number of paging files supported by Windows.

32-bit Windows has a maximum paging file size of 16TB (4GB if you for some reason run in non-PAE mode) (Physical Address Extension (PAE) is a feature to allow (32-bit) x86 processors to access a physical address space (including random access memory and memory mapped devices) larger than 4 gigabytes.)

64-bit Windows can having paging files that are up to 16TB on x64  and 32TB on IA64 and 3 For all versions, Windows supports up to 16 paging files, where each must be on a separate volume.

Some feel having no paging file results in better performance, but in general, having a paging file means Windows can write pages on the modified list (which represent pages that aren’t being accessed actively but have not been saved to disk) out to the paging file, thus making that memory available for more useful purposes (processes or file cache). So while there may be some workloads that perform better with no paging file, in general having one will mean more usable memory being available to the system (never mind that Windows won’t be able to write kernel crash dumps without a paging file sized large enough to hold them).

VMware and Page Files

When creating VM’s in VMware either Linux or Windows, VMware by default makes the Page File the same size as the assigned Memory. A 1:1 Mapping

E.g 60GB Disk +32GB Page File = 92GB Total Storage taken

This came up in a meeting we had to discuss why some of our VM’s which were assigned 255GB memory were taking up so much storage space!!!

The file on VMware for the swap is called VM-NAME.vswp if you have a look in the Datastore Browser for a VM

From a Forum

*.vswp file – This is the VM swap file (earlier ESX versions had a per host swap file) and is created to allow for memory overcommitment on a ESX server. The file is created when a VM is powered on and deleted when it is powered off. By default when you create a VM the memory reservation is set to zero, meaning no memory is reserved for the VM and it can potentially be 100% overcommitted. As a result of this a vswp file is created equal to the amount of memory that the VM is assigned minus the memory reservation that is configured for the VM. So a VM that is configured with 2GB of memory will create a 2GB vswp file when it is powered on, if you set a memory reservation for 1GB, then it will only create a 1GB vswp file. If you specify a 2GB reservation then it creates a 0 byte file that it does not use. When you do specify a memory reservation then physical RAM from the host will be reserved for the VM and not usable by any other VM’s on that host. A VM will not use it vswp file as long as physical RAM is available on the host. Once all physical RAM is used on the host by all its VM’s and it becomes overcommitted then VM’s start to use their vswp files instead of physical memory. Since the vswp file is a disk file it will effect the performance of the VM when this happens

VMware Visio Action Pack

Overview

http://xtravirt.com/visio-action-pack-re-released-free-member-download

This excellent Visio icon pack for VMware & virtualization has been re-released as a free member download. It contains over 70 unique icons together with a user guide and sample diagram templates.

It is designed for Windows Operating Systems and has been designed to run on Microsoft Visio 2003 and Microsoft Visio 2007, although also runs on Mac Omnigraffle. They are not compatible with Microsoft Visio 2000 or 2002

Dilbert

Whats the diffference between VMware vCLI and VMware PowerCLI

To automate the management of an ESXi deployment, VMware has created easy-to-use scripting tools for managing day-to-day operations. You can write scripts with the same functionality as the vSphere client to automate manual tasks , allowing you to manage small- to large-scale environments efficiently. These tools work well with both ESXi and ESX hosts, allowing you to easily administer mixed environments.

Both PowerCLI and vCLI are built on the same interface as the vSphere Client. They can be pointed directly at an ESXi host or they can be pointed at vCenter. When pointed at a host, they can execute commands directly on an ESXi host, similar to how a command in the Console OS of ESX operates on only that host. Local authentication is required in this case. Alternatively, when communicating through vCenter, the vCLI and PowerCLI commands benefit from the same authentication (e.g. Active Directory), roles and privileges, and event logging as vSphere Client interactions. This provides for a much more secure and audit-able management framework

VMware vSphere™ PowerCLI

VMware vSphere PowerCLI is a powerful command line tool for automating all aspects of vSphere management, including host, network, storage, VM, guest OS and more. PowerCLI is distributed as a Windows PowerShell snapin, and includes more than 150 PowerShell cmdlets, along with documentation and samples. PowerCLI seamlessly blends the vSphere platform with Windows and .NET, which means you can use PowerCLI by itself or within many different 3rd-party tools

VMware vSphere™ Command Line Interface (vCLI)

VMware vSphere™ Command Line Interface (vCLI) is a set of command-line utilities that help you provision, configure and maintain your ESX and ESXi hosts. The vCLI command set allows you to run common system administration commands against VMware ESXi systems from any machine with network access to those systems. You can run most vCLI commands against a vCenter Server system and target any ESXi system that the vCenter Server system manages. There are commands that can completely automate the initial configuration of an ESXi host and others that provide troubleshooting and diagnostic capabilities. VMware provides vCLI packages for installation on both Windows and Linux systems

VMware vSphere™ Management Assistant (vMA)

The VMware vSphere™ Management Assistant (vMA) is a virtual appliance that brings together all the tools you need to manage vSphere. vMA packages the vSphere Command Line Interface, the vSphere SDK for Perl, as well as logging and authentication modules into one convenient bundle. vMA can also host 3rd-party agents for added management power.

And another Dilbert

Dilbert