Yes, that’s right… I just rebooted a whole rack of servers hosting customer applications…
You can’t just do that!
The best thing is… I didn’t have to call the Mendix CRT together. No RCA for the biggest outage ever in Mendix history needs to be written. The customers did not notice anything about it.
One of the best things of using centralized storage and a bunch of servers with virtualization software like Xen is that you can do live migration of virtual machines between separate phsyical servers. This is actually as crazy as it sounds. You have a linux virtual machine running on physical server A, and it gets moved to physical server B, without any noticeable downtime.
This graph shows what doing a full reboot cycle of a bunch of servers running at full capacity might look like, as seen by our monitoring systems:
When doing live migration, Xen takes advantage of the fact that everything visible by a virtual machine is in fact virtualized (duh.. :] ). Any memory address in use secretly maps to another real memory address of the physical hardware. A network interface is presented as a real network card, but it’s actually just a fake network card that is connected to a virtual switch managed by the virtualization system. The cpu that is executing instructions inside the virtual machine is actually doing so on any of the real processors that is available in the physical hardware, just depending on where the virtualization ‘hypervisor’ determines it to be run right now.
By just copying all of the live memory pages, Xen is actually able to move a running virtual machine somewhere else. When starting the move, all contents of the operating system memory are copied over the network to a new instance of the virtual machine on the target physical server (which in our case is a normal gigabit/s network, transfering around 120MB per second). After copying all memory pages, Xen just starts over again, doing the same thing for all memory pages that have been altered since we just copied them. This is done a few times, until Xen determines that the amount of transfer needed is small enough to be able to completely pause the virtual machine, transfer the last few bits of memory information, and to resume the virtual machine on the new hardware.
All of this makes it possible to transfer the whole state of a virtual machine to other physical hardware, without interrupting its execution.
And why would you do this?
At Mendix, we use the live migration functionality regularly to re-balance the amount of virtual machines that is running on physical hardware for optimal performance, to work around imminent problems, like a possible hardware failure, maybe even move your applications to a brand new type of hardware, or to do a full reboot cycle of all physical hardware, installing the latest security updates to Xen and the Linux kernel. The latter is exactly what I’ve been doing here.
Yesterday evening, and today, 11 physical servers which form an N+1 server cluster in one of the server racks at our colocation facility at XS4All in Amsterdam in the Netherlands were rebooted. This bunch of servers is actually located in the first full server rack that we started using at XS4All. The servers are used for hosting of customer applications, as well as parts of the Mendix platform itself, like home.mendix.com and even this blog!
Without being able to do this, it would be totally impossible for us to ever do any maintenance on the part of our infrastructure that runs the linux virtual machines which run customer applications. Imagine having to call a few dozen customers and getting them to agree on rebooting all of their applications in a timespan of two evenings… That would be a major planning and communication disaster…
Moar pics or it didn’t happen!
And here’s some of them… Nicely fit together in the server rack…
Green cables are iSCSI network storage, yellow cables carry network traffic between virtual machines, or to and from the interwebz. There’s two green and two yellow for every server, to make it possible to rewire, move stuff around or just for failure of any of the cables or network interfaces. Red cables are management interfaces.
And if you wondered where all those cables go… The other end is just a switch port. :-) The yellow and green cable which go to the right of the server rack end up on another switch than the ones that go via the left. And if you wonder what those cables just did… Well, probably they did send some electical signals which carried the contents of this page, pushing it towards the network uplink to the internet and to the computer you’re looking at right now. :o)