Harmfull bug in oVirt block storage

Kuko Armas <kuko@canarytek.com>
| ,

Or how an apparently simple and harmless change can wreck havoc on several critical production Virtual Machines

Introduction

Some days ago I wrote a post on how to manually recover from different disk errors on oVirt. On that post I said that I had been experiencing weird problems on disk operations on oVirt, and I had lost some very critical virtual machines.

The symptoms

This problem was happening on a client with a rather complex oVirt infraestructure. They have a total of 4 iSCSI SANs, 16 blades and almost 70 virtual machines. It’s a VoIP provider, so most VMs are client’s PBX and are extremelly critical.

We had been reorganizing storage for some months, moving virtual disks around the SAN’s with live storage. We also do nightly backups for the most critical VM. These backups are done by creating a snapshot, cloning it, and deleteing it. with this script One day we noticed that some of the automatic backup snapshots were not being deleted, and when we tried to delete it we got an error, and some disks were marked as illegal. We didn’t stop the VM to try to delete the snapshot because, as I said, most VMs where extremelly critical.

We also noticed errors with disks live migration: the live migration snapshot was not being deleted. And when we tried to live delete the snapshot, we got an error and the disk was marked as illegall.

We started to do some tests with non-critical VM and noticed that almost all live operations on disks where failing. In the logs we could see many errors on LVM commands:

LogicalVolumeExtendError: Logical Volume extend failed:

As I understood, oVirt tags a disk as illegal when it thinks there is something seriously wrong with that disk. And to avoid data loss, oVirt won’t boot any VM with illegal disks.

That was scary, because it ment that if some node failed or we stopped any VM with illegal disks, it won’t boot any more

So, we started to do disks cleanup late at night when VoIP traffic was low. We had to manually fix a lot of different errors. These fixes is what I described in the previous post

The problem

At that time we knew there was something seriously broken with this oVirt installation, but we didn’t know what it was. We tried a lot of things:

  • Update engine and all ovirt nodes to latest 3.6 version
  • Try changing the SPM role to different hosts
  • Created an NFS domain to test. We found out that the NFS domain worked flawlessly, but it was just a test, our SANS only worked with iSCSI. At least we found out that the problem had something to do with block storage (LVM)

But nothing we tried solved the problem.

While googling around, I found some apparently similar problems. I saw a bug reported by a long time friend that talked about a change in lvm return code. I seems someone had the great idea to change the return code to something more “convenient”. And of course, a lot of tools that relied on LVM (like the oVirt agent vdsm) started to get confused on this new retcode interpreting that the operation failed, when it really didn’t.

I guessed that what was probably happening was that all these operations succeeded, but ovirt-engine thought they failed and didn’t update the metadata on the database. That’s why we had a lot of inconsistencies between what we had in the engine database and the real status of the disks, like snapshots that didn’t exist, qcow2 layers pinting to base images that where already moved to a different storage domain, etc.

The solution

I could probably go back to a previous version of LVM, but that could break a lot of dependencies. The obvious fix was to make vdsm understand the new LVM return codes. I found out that there was already a fix on GitHub vsdm project.

I think I even saw an updated vdsm RPM for oVirt 4, but this client was using oVirt 3.6 and I didn’t find any fix for this version. So I decided to build an RPM based in the GitHub code. Fortunately, vdsm uses python and the rpm build process is very easy. This is what I did:

  • Clone vdsm repo and checkout the version with the fix applied
    git clone https://github.com/oVirt/vdsm.git
    cd vdsm
    git checkout v4.17.35
    
  • Install build dependencies
    yum install -y $(cat ./automation/build-artifacts.packages)
    
  • Build RPMS
    ./automation/build-artifacts.sh
    
  • Change to the created RPM directory and install the RPMS I needed
    cd /root/rpmbuild/RPMS/noarch/
    yum localinstall vdsm-4.17.35-1.el7.centos.noarch.rpm vdsm-gluster-4.17.35-1.el7.centos.noarch.rpm vdsm-xmlrpc-4.17.35-1.el7.centos.noarch.rpm vdsm-yajsonrpc-4.17.35-1.el7.centos.noarch.rpm vdsm-infra-4.17.35-1.el7.centos.noarch.rpm vdsm-cli-4.17.35-1.el7.centos.noarch.rpm vdsm-jsonrpc-4.17.35-1.el7.centos.noarch.rpm vdsm-hook-vmfex-dev-4.17.35-1.el7.centos.noarch.rpm vdsm-python-4.17.35-1.el7.centos.noarch.rpm
    

To check if this fixed the problems before doing it on all nodes, I promoted to SPM the node where I installed this version, and tried the storage operations that used to fail… AND ALL OF THEM WORKED

The moral of the story is: no change is too small to break your systems, and of course… who the f*** are you?, and why are you messing with my return codes???