Xen CPUID masking
In order to be able to live migrate a virtual machine between slightly different types of hardware, it might be needed to hide some of the cpu features of newer types. By preventing the newly added features from being used when starting a virtual machine on the newer type of hardware, the virtual machines can still be migrated back to the older hardware in the same cluster.
Using this technique is commonly refered to as “CPU feature masking”, “CPU feature levelling” or “CPUID masking”. Technically, it’s done by trying to intercept calls for the CPUID cpu instruction, and having the hypervisor modify the returned result on the fly.
This page describes how to construct the magic cpuid bitmasks that go in the xl configuration file for a virtual machine, when using Xen in paravirtualization mode.
My use case
As an example I’ll be using a collection of different types of HP Proliant dl360 gen8 and gen9 servers, that have been added to a xen cluster over the last few years. Different cpus that are used inside the server models are Intel Xeon E5-2650 (Sandy Bridge), E5-2650 v2 (Ivy Bridge), E5-2650 v3 (Haswell) and E5-2660 v3 (Haswell).
In the end I want to be able to first start a Xen domU on any of those servers and then be able to live migrate it to any other one, without experiencing crashes.
The software used for this test:
- Xen 4.4.1, since the hypervisors are running Debian Jessie
- Dom0 running linux kernel 3.16.7-ckt20-1+deb8u3 from Debian Jessie
- Linux kernel 4.7-rc7 for the virtual machine (running Debian Stretch)
In order to be able to do live migration, the disks of virtual machines are put on shared iSCSI attached block storage with LVM on top.
Warning: During the cpuid testing as shown below, I’m using manual commands for migration etc. in a test environment because I need to influence the configuration files used when doing so. Since the scope of this post is the cpuid handling, I’m happily ignoring writing anything about serializing of LVM commands and proper protection against activating a logical volume multiple times in different places.
Seeing it fail without feature masking
First of all, let’s see what happens when the difference in available cpu features is ignored, and we try to live migrate to an older server.
Here’s an example configuration file to start my test virtual machine:
E5-2660-v3 -# cat debug-xen-stretch kernel = "/usr/lib/pvgrub2/grub-x86_64-xen-xvda-fire-ze-missile" name = "debug-xen-stretch" memory = 1024 vcpus = 4 vif = [ "mac=02:00:0a:8c:29:66,bridge=ovs0.41", ] disk = [ "/dev/fsociety/debug-xen-stretch,raw,xvda,rw", ]
E5-2660-v3 -# xen create -c debug-xen-stretch Parsing config from debug-xen-stretch Loading Linux ... Loading initial ramdisk ... [ 0.000000] Linux version 4.7.0-rc7-amd64 (firstname.lastname@example.org) (gcc version 5.4.0 20160609 (Debian 5.4.0-6) ) #1 SMP Debian 4.7~rc7-1~exp1 (2016-07-14) [ 0.000000] Command line: root=/dev/xvda ro elevator=noop ..etc.. Debian GNU/Linux stretch/sid debug-xen-stretch hvc0 debug-xen-stretch login:
Live migrate to a E5-2650 v2 host:
E5-2660-v3 -# xen migrate -C debug-xen-stretch -s "" debug-xen-stretch "socat - TCP:E5-2650-v2:8002" Saving to migration stream new xl format (info 0x0/0x0/677) migration sender: Target has acknowledged transfer. migration sender: Giving target permission to start. migration sender: Target reports successful startup. Migration successful.
Now have a look at the target, the E5-2650 v2 based host:
E5-2650-v2 -# xen console debug-xen-stretch [ 83.068161] xen:grant_table: Grant tables using version 1 layout [ 83.068161] PM: noirq restore of devices complete after 0.057 msecs [ 83.068161] PM: early restore of devices complete after 0.042 msecs [ 83.071403] PM: restore of devices complete after 2.572 msecs [ 83.071437] Restarting tasks ... done. Debian GNU/Linux stretch/sid debug-xen-stretch hvc0 debug-xen-stretch login: [ 89.821718] traps: sshd trap invalid opcode ip:7fa3def6f7a3 sp:7ffd3dff1580 error:0 in libcrypto.so.1.0.2[7fa3deeec000+234000] [ 89.822721] traps: systemd trap invalid opcode ip:7fc02896d340 sp:7ffd5d884858 error:0 in libc-2.23.so[7fc028822000+197000]
Boom! invalid opcode! Not good! Software is crashing all over the place because it’s using cpu instructions that are not present on the hardware which it’s currently running on. The virtual machine is quite unusable and I have to use xen destroy to get rid of it.
Apparently we need to apply some magic to make this actually work.
Gather information about your CPUs
Before starting to figure out the CPU feature masking settings, get a good grasp of the hardware and CPU types that you’re dealing with. For the Intel Xeon types I mentioned, information pages which list functionality are easily to be found. Compare the information to get an idea about what new functionality was added when etc.
Next, log in to one of you Xen dom0s, and run cpuid:
E5-2650 -# cpuid -1r CPU: 0x00000000 0x00: eax=0x0000000d ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69 0x00000001 0x00: eax=0x000206d7 ebx=0x26200800 ecx=0x1fbee3ff edx=0xbfebfbff 0x00000002 0x00: eax=0x76035a01 ebx=0x00f0b0ff ecx=0x00000000 edx=0x00ca0000 0x00000003 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x00000004 0x00: eax=0x3c004121 ebx=0x01c0003f ecx=0x0000003f edx=0x00000000 0x00000004 0x01: eax=0x3c004122 ebx=0x01c0003f ecx=0x0000003f edx=0x00000000 0x00000004 0x02: eax=0x3c004143 ebx=0x01c0003f ecx=0x000001ff edx=0x00000000 0x00000004 0x03: eax=0x3c07c163 ebx=0x04c0003f ecx=0x00003fff edx=0x00000006 0x00000005 0x00: eax=0x00000040 ebx=0x00000040 ecx=0x00000003 edx=0x00021120 0x00000006 0x00: eax=0x00000077 ebx=0x00000002 ecx=0x00000009 edx=0x00000000 0x00000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x00000008 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x00000009 0x00: eax=0x00000001 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x0000000a 0x00: eax=0x07300803 ebx=0x00000000 ecx=0x00000000 edx=0x00000603 0x0000000b 0x00: eax=0x00000001 ebx=0x00000002 ecx=0x00000100 edx=0x00000026 0x0000000b 0x01: eax=0x00000005 ebx=0x00000010 ecx=0x00000201 edx=0x00000026 0x0000000c 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 edx=0x00000000 0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000 0x0000000d 0x3e: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000000 0x00: eax=0x80000008 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000001 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000001 edx=0x2c100800 0x80000002 0x00: eax=0x20202020 ebx=0x49202020 ecx=0x6c65746e edx=0x20295228 0x80000003 0x00: eax=0x6e6f6558 ebx=0x20295228 ecx=0x20555043 edx=0x322d3545 0x80000004 0x00: eax=0x20303536 ebx=0x20402030 ecx=0x30302e32 edx=0x007a4847 0x80000005 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000006 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x01006040 edx=0x00000000 0x80000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000100 0x80000008 0x00: eax=0x0000302e ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80860000 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 edx=0x00000000 0xc0000000 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 edx=0x00000000
Whoa! What does all of this mean?
Running cpuid without the r, like cpuid -1, provides an interpreted listing of all the information above. Next, have a look at the wikipedia page about CPUID By reading this page, you should get a bit of an understanding about how the CPUID instruction works, and what kind of information can be retrieved, and how you can match it with the cpuid command output.
The relevant bits
As you can see, there’s a lot of things going on in those lines of numbers. Luckily, it’s not needed to pay attention to all of them at once.
The fields containing feature bits are also called “leafs”.
The primary leafs of interest are these:
- in: EAX=1, out: EDX, ECX (“Processor Info and Feature Bits”)
- in: EAX=7, ECX=0, out: EBX, ECX (“Extended Features”)
In the cpuid output shown above, it’s these fields:
- 0x00000001 0x00: ecx=0x1fbee3ff edx=0xbfebfbff
- 0x00000007 0x00: ebx=0x00000000 ecx=0x00000000
For example, the 1,0/ecx value 0x1fbee3ff converted to bits is
- The rightmost bit is the one with the lowest value (bit 0) and the leftmost one is bit 31.
3 2 1 0 10987654321098765432109876543210 -------------------------------- 00011111101111101110001111111111
The bits show that this cpu does not yet have the rdrand (bit 30) or fma (bit 12) instruction.
Another way to get a view on this information is by looking at the flags list in /proc/cpuinfo. But, be aware that any tool which interprets the raw bits will not show any feature which it has no name defined for. So, I recommend using the raw numbers. The nice descriptions of the cpuid program output are nice, but be sure to verify before you actually use them.
Compare your CPU models, determine differences
When running cpuid -1r in the different Xen dom0s I have available here, this is part of the result:
E5-2650 Sandy Bridge 0x00000000 0x00: eax=0x0000000d ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69 0x00000001 0x00: eax=0x000206d7 ebx=0x2a200800 ecx=0x1fbee3ff edx=0xbfebfbff 0x00000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 E5-2650 v2 Ivy Bridge 0x00000000 0x00: eax=0x0000000d ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69 0x00000001 0x00: eax=0x000306e4 ebx=0x08200800 ecx=0x7fbee3ff edx=0xbfebfbff 0x00000007 0x00: eax=0x00000000 ebx=0x00000281 ecx=0x00000000 edx=0x00000000 E5-2650 v3 Haswell 0x00000000 0x00: eax=0x0000000f ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69 0x00000001 0x00: eax=0x000306f2 ebx=0x38200800 ecx=0x7ffefbff edx=0xbfebfbff 0x00000007 0x00: eax=0x00000000 ebx=0x000037ab ecx=0x00000000 edx=0x00000000 E5-2660 v3 Haswell 0x00000000 0x00: eax=0x0000000f ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69 0x00000001 0x00: eax=0x000306f2 ebx=0x34200800 ecx=0x7ffefbff edx=0xbfebfbff 0x00000007 0x00: eax=0x00000000 ebx=0x000037ab ecx=0x00000000 edx=0x00000000
Using this information, I want to find out which functionality is exactly supported by any but not all of the cpu models I’m using.
In more detail, let’s compare the contents of different relevant fields:
1/edx 3 2 1 0 10987654321098765432109876543210 -------------------------------- E5-2650 10111111111010111111101111111111 0xbfebfbff E5-2650 v2 10111111111010111111101111111111 0xbfebfbff E5-2650 v3 10111111111010111111101111111111 0xbfebfbff E5-2660 v3 10111111111010111111101111111111 0xbfebfbff
In this case, no difference can be found. This means no overrides to this leaf would be needed to support live migration between all types of hardware.
1/ecx 3 2 1 0 10987654321098765432109876543210 -------------------------------- E5-2650 00011111101111101110001111111111 0x1fbee3ff E5-2650 v2 01111111101111101110001111111111 0x7fbee3ff E5-2650 v3 01111111111111101111101111111111 0x7ffefbff E5-2660 v3 01111111111111101111101111111111 0x7ffefbff mask x00xxxxxx0xxxxxxxxx00xxxxxxxxxxx
The differences found are: sdbg(11), fma(12), movbe(22), f16c(29) and rdrnd(30).
The last row is a mask that can be used in the xen configuration file. Every feature that is present on a subset of the cpus is marked as 0, to force disable it. All other bits are represented as an ‘x’, which means that xen can do whatever it likes to do with it. In some cases, this might mean setting a 1 value even when all of the observed values are 0, like with the hypervisor (1/ecx/31) bit.
7,0/ebx 3 2 1 0 10987654321098765432109876543210 -------------------------------- E5-2650 00000000000000000000000000000000 0x00000000 E5-2650 v2 00000000000000000000001000000001 0x00000201 E5-2650 v3 00000000000000000011011110101011 0x000037ab E5-2660 v3 00000000000000000011011110101011 0x000037ab mask xxxxxxxxxxxxxxxxxx00x0000x0x0x00
Add the mask to the xl config file
The relevant masks for 1/ecx and 7,0/ebx can be put into the guest configuration file like this:
cpuid = [ "0x00000001:ecx=x00xxxxxx0xxxxxxxxx00xxxxxxxxxxx", "0x00000007,0x00:ebx=xxxxxxxxxxxxxxxxxx00x0000x0x0x00", ]
By running cpuid -1r inside the virtual machine, you can verify that the fields marked as 0 in the mask got cleared out. For some features differences should be visible in kernel startup logs, like “AVX version of gcm_enc/dec engaged.” instead of “AVX2 version of gcm_enc/dec engaged.”
With these additional settings I’m able to start the virtual machine on any of the different types of hardware, and live migrate to all of them.
A helper program to quickly calculate the masks (added on May 29 2017)
It’s good to do the calculations by hand the first time, to be sure you understand how the bits need to end up in the right places. But, when doing it again and again when your collection of hardware changes, it quickly gets boring and error prone.
After reading this post, Armando Vega wrote a little program that you can feed
a collection of files containing the
cpuid -1r output of each host, and it
will tell you the result that needs to go into domain config file.
It’s the Xen Mask Calculator
It can also be used of course to double check your manual calculations. :)
It’s perfectly possible to put a configuration into the cpuid masks that does not make sense at all, which will probably cause interesting kernel crashes (oops, general protection fault, panic etc) during boot or shortly thereafter.
Running without any mask
Starting a virtual machine on the oldest type of hardware in the cluster without applying the mask does not guarantee success when migrating it to a newer type. Apparently the cpuid bits are not only read when starting the virtual machine. After the live migration, the cpuid output will show the newer features from the hardware the virtual machine is currently running on. I’ve seen spectacular stack traces when just migrating from the E2650 to the E2650 v2 (without mask) and then trying to migrate back immediately.
Adding new hardware
Since adding hardware of a newer type to this cluster will probably introduce new features, care has to be taken to adjust the masks before starting to use it. By providing the xl configuration file (using the -C option when live migrating) an updated cpu mask can be provided that gets applied when reconstructing the virtual machine on the target host. So, first figure out the new mask, then make sure your configuration files get updated, and then make sure you provide them again when restarting or live migrating.
There’s a lot of other leafs with information in the cpuid output. I haven’t seen a particular reason yet to start mangling those, so I’m leaving them alone for now. While debugging kernel crashes related to RAPL I was digging around in linux kernel source and git commits and encountered a story about a feature that didn’t get a feature flag, resulting in some ugly guessing going on based on cpu model and type. Anyway, reading the code that handles cpu differences and the history in the commits makes you appreciate the work that is being done to work around all the quircks, to let it just work (tm) for you.
When trying to construct the masks for your collection of hardware, I would advise to keep the changes as few as possible, like shown above. Start using it, and continue investigating when it results in later failures. The less changes are made, the higher the chance that related magic present in xen and in the linux kernel is not disturbed by it.
I wrote this blog post since I could not find any proper guide on how to correctly determine the mask settings. A few years ago I did spend some initial effort to create the masks, which have been working correctly for us at Mendix until recent kernels (4.5 and later) started showing weird crashes or did not even start at all. So, in the last few days I dived back in to improve the situation.
So, I hope this information will save the reader some hours. The trick is… if you exactly know what to do, it’s fairly simple. If you don’t… it can easily take a few days of headaches, reverse engineering, trying to find out the meaning of kernel crashes, and trying out a huge amount of different scenarios of start, live migrate, reboot etc etc over all types of hardware you have to finally get this figured out.
If you spot errors, wrong assumptions or have important information to add, let me know at email@example.com, so I can update the article.