Wednesday, July 20, 2016

The sales reps don't explain things well

We hired a new storage engineer the other month, and shortly thereafter lost him when he got an offer he couldn't refuse. In the meantime he helped commission an addition to our biggest lustre filesystem; a new Compellent system. Folklore says lustre likes its component filesystems to be roughly the same size, so this was divvied up into RAID6 virtual arrays. 160 disks *4TB/each *.9 to get TiB = 576TiB. 32TiB is a nice number, close to the size of the other arrays in the lustre system, so he configured 18 of them.

If you know storage you just went "rookie mistake: he forgot about the parity disks(*); you can really only make 15 arrays of that size, not 18." Yep, and usually that kind of mistake doesn't matter--when you try to do something like that the system does the arithmetic for you and you find out pretty quickly.

But the Compellents are clever. They were designed for a business model where you supply disk storage to 1000 machines, each with its own .5TB storage. If you look the machine you're browsing from, you notice there's a lot of "wasted" space; you don't use all .5TB. I certainly don't. So suppose your storage server pretended that it really had 800TB instead. That way your business can supply disk storage to 1500 machines. The space is over-committed, but just like airlines over-booking, most of the time it doesn't matter. If only a handful of users really need that full .5TB, they can get it without any intervention, and the rest don't know the difference. Nice and clever. (The Compellent is even more clever than that: it divides up space in virtual arrays that reflect your preferences for virtual disk sizes and does other cute tricks that we have absolutely no use for.)

We, on the other hand, want to fill up all the space with data, in one giant chunk. So the upshot was that the system pretended to have 25% more space than it really did. We set about filling it all up.

You can see that this is not going to end well.

It got worse. The system supports RAID6 (higher density) virtual storage, RAID10-dual striped (high read/write performance), or a dynamic combination in which the system accepts writes as high performance RAID10, and at some configurable after-hours time spends some time translating the fat RAID10 down to slow RAID6. That's 3 different modes. For the last mode (balance) you can either set a time or rely on a default that looks to be once a week or so (in hindsight).

With that in mind: imagine that you use the defaults--balance (write fat, translate to thin later) and once a week it does translation.

Now start filling it up slowly. All looked OK, no stability problems, so we let her rip.

Um. With maybe 300TiB already aboard, we now loaded about 70TiB in a few days. The clever machine turned this 70TiB input into 210TiB high performance dual-striped RAID10. Now it had no space left to do translations, and decided the safest thing to do was become readonly.

Down went the lustre filesystem. The only guy left on the team who had been learning lustre is still in the north woods with no cell phone coverage. Away went the weekend. We kept 3 pairs of eyes on everything as we learned as we went along.

We spent a lot of time with Dell engineers Monday. One was able to cobble together an array out of the spare disks to start a very slow translation job, while we try to drain one of the logical arrays so we can delete it and make the space to do fast translations. (Goosing that along is the reason I'm still awake right now. I spent a fair bit of today trying to clean up corrupted files.)

It was our configuration screw-up, and we're grateful for Dell pulling out chestnuts out of the fire. But we're going to change some procedures...


(*) RAID5 uses N data and 1 parity disk. If a disk fails (believe me, they do), you haven't lost any data. You can pop a spare in and rebuild the array. Problem: the rebuilding is pretty I/O intensive, and disks from the same family often have similar lifetimes. If you lose another disk during the rebuilding process, you've lost all the data in the array. Hence RAID6: N data and 2 parity disks. So if you have 10 4TB disks, by the time you're done putting it together in a safe array you have only 8 effective disks: 32TB instead of 40TB.

RAID10 is nice and fast and robust: each disk has a duplicate. Dual striped RAID10 is even more robust and fast: each disk has 2 spares. But that means that 2/3 of your space is "wasted."

No comments: