Throughput correction

I got my math VERY wrong for the new LSI HBA card, because I thought it was based on an onboard port expander. It is not. The card has two, four-lane 6G SAS connectors, for a total throughput of 48Gbit/sec. Divide that total throughput by the 24 attached drives, and you get a maximum theoretical throughput of 2Gbit/sec (200Mbyte/sec) per attached drive.


Nothing worth it is easy

The hardware in the SAN I've built is turning out to be a mixed bag of good inexpensive parts, and disappointing expensive ones. While this may be contrary to the popular belief that expensive parts work best, experience has shown time and again that the price tag has little to do with performance or stability.


The motherboard's iKVM/BMC had to be completely reset, which the manufacturer couldn't help me with. A quick search of the chip manufacturer's site turned up a tool that would allow me to update the firmware and erase the chip in one process. This turned out to be exactly what I needed to do, and it worked perfectly. I've since notified Tyan of the utility, in the event this happens in the future to another customer.


The hard drives that I initially believed to be DOA actually seem to test out fine with the motherboard's onboard LSI SAS HBA/RAID chipset. In fact, the onboard chipset seems to behave better and more predictably in nearly every way! This is disappointing because the Areca RAID card cost over $1300, and the onboard LSI chipset is essentially a throw-away part. It's so cheap to put on the board that's not worth removing in newer hardware versions!


The HBA/RAID driver story is more of the same. The LSI chipset uses the "mptsas" kernel driver, which has had many contributors offering fixes and enhancements. The Areca card seems to suffer from the opposite in a bad way. Areca themselves have published a newer version that the one included in the 3.0.x Linux kernel, but it has some pretty big flaws that make it a bit wonky. The first and most noticable flaw is the fact that a SAS/SATA device the stops responding causes the driver to freak out. To me, this is totally unacceptable. I'd rather see the device drop from the bus and be considered "offline" than the driver just freezing. A second big blemish is the remaining reference to the "Big Kernel Lock", which has been removable since something like 2.6.28 and is now off by default in many distributions shipping a 3.x kernel.


I've been in contact with Areca support, and the things they've had me try have only further proven that there is a problem with the driver. It has also indicated that there may be a pretty big problem with the card itself as well! For $1300, I'd expect a HBA/RAID card to undergo some serious QA process before shipping. But, more hard drives are causing issues, so at this point I'm pretty sure the batch of disks I purchased is fine and the Areca card is the source of all my issues.


So, I've started the RMA process for the existing RAID card, and ordered a LSI 9211-8i card. This card has 2 SFF-8087 ports for a total of 8 SAS 6G lanes. I've also ordered an HP SAS expander card which will give me 36 total ports and dual 4-lane SAS connectivity for a total theoretical throughput of 12Gbit/sec. This should work out to somewhere around 50Mbyte/sec per drive if all of them are active simultaneously. In reality, that rarely happens in a RAID system so I'm confident this configuration will be what I need. From the reviews I've read, the card seems like it's a great solution and really well liked by benchmarking forums. And it's rated for something like 290,000 IO/s max!


Assembled and running-ish

The new motherboard came, the remainder of the drives were mounted, and the final touches of assembly were completed. As part of the assembly, I tried to update the firmware on the Areca RAID car, the motherboard's BIOS, and the moterboard's built in iKVM/BMC board. Unfortunately, this last component's update process failed in a pretty bad way. Even worse, the recovery process is undocumented and also failed.


While testing the hard disks themselves, I found a dead unit and some pretty bad behavior in the RAID card's behavior. When trying to write data to a dead drive, the RAID card's driver just keeps trying forever. This causes the driver to lock up completely without responding to the process that's performing the write, which causes the writing process to also lock up! I'm investigating an update to the card's driver, or a firmware setting that would modify this behavior. But, the fact that this is the default way the system works is frightening at the least!


So, next week looks like it's going to be benchmarking time. From some of the preliminary tests I've performed while breaking the hardware in, the hard drives look like they'll run pretty fast. The SSDs are, simply put, stupid fast. They write data so fast that benchmarking them against the hard drives doesn't even make sense. ZFS itself looks like it works pretty well, as well. I haven't seen many slowdowns from using it in a test environment, so that should bode very well for the new SAN.



Today I took delivery of the parts for the SAN build. I've started putting drives in trays and wiring the case itself, which I can tell is going to take a while. I expect putting drives in trays will take around an hour of time in total.

Almost immediately I realized I ordered an Extended ATX motherboard, but the case only supports Micro ATX, ATX, CEB, and EEB. Naturally, I also ordered 4x SFF-8087 cables for the HBA to Backplane connectivity. Each of these cables can handle 4 SAS/SATA lanes, which means I can run 16 out of my 24 drives. Oops. So, a new board and two more SFF-8087's are on their way tomorrow.


Here's the case unpacked from its double boxes. It's surprisingly light for how sturdy it feels! The manufacturer says it's around 40 lbs, but I would have pegged it somewhere in the low 30lbs range. All of the edges are rolled, so you don't get the stamped steel "case cut" issue that has plagued cases in the past.

Case unpacked


The front of the case holds 24 hot-swap disk trays. These feel pretty cheap when you pull them out and push them back in, with no satisfying click to let you know you've pushed the drive in far enough. This is really nitpicking, though, as you really don't want to push against the SATA/SAS interfaces very hard. And from the feel of things, the interface is more lilkely to break than the trays.

Case front


This is the motherboard I (wrongly) ordered. It's pretty beefy to say the least, with more than enough RAM slots for my needs. The plan for now is to keep the second CPU and RAM banks empty until more performance is needed.

Wrong (but nice) board


I got the CPU in the socket, and the RAM in the proper bank immediately. Since the CPU uses tripple-channel DDR3, DIMMs have to be installed in groups of 3. You'll notice I forgot to order a CPU heatsink and FAN assembly, which turned out to be fortunate since the board has to come out.

CPU and RAM mounted


This is when it dawned on me that something might be wrong. Everything was lining up great until I looked at the void where the power supplies will be. Damn!

Not looking good


Yep. That's not going to fit at all. Well, RMA for this board I guess.

Not going to work


Most people don't get to see what OEM packaging for hard drives looks like. When you're a system builder, you get them in bulk packs like this. The only thing I hate is unwrapping a ton of them, though. But I guess I'll have 24 anti-static bags after this is all done.

Hard Drives


I'm using Western Digital 1TB drives with TLER, which better support being in RAID arrays. Though I'm not building a traditional RAID array. These were cheaper than their Seagate counterparts and more readily available, so that made the decision pretty easy. The drives slip into their trays and can be screwed in from the side or the bottom. I've chosen to use the mounting holes on the bottom to keep from interfering with the sliding mechanism in the case.

1TB HDDs mounted


The half-way mark has been reached. These 12 drives account for more space than most of the people I know have used in total in their whole life. And I'm going to double it. Hopefully I can start benchmarking this thing next week to see how the HBA card performs with this many drives!

12 done, 12 to go


That's all for now. Tomorrow I should have the power supplies and all of the drives installed. Then it's a waiting game until Thursday when the new motherboard and CPU cooler should arrive. If everything continues as planned, Friday will be a great day at work!


The build begins

At work, I've recently been facing a two-pronged problem with our server infrastructure. It's been weighing on my mind for a while, and recently I decided that something had to be done. I wasn't sure what I would end up doing, but I needed to act carefully and quickly. These aren't usually two things that go hand-in-hand when planning out IT infrastructure whose life expectency needs to be five or more years!

The first prong of our problem was storage space. To be more specific, we're running out of it. We have a SAN at our HQ location that we use to host two VMware vSphere 4.x instances, along with data for our network monitoring system (OpenNMS) and a few other odds and ends. I calculated that within six months we would be at 100% of the capacity of that SAN without adding any new projects that required storage space. The problem here is that there's a project currently in the works that requires not only the rest of the space in this SAN, but an additional 3x its original capacity! To make matters that much worse, when I projected storage needs out to 18 months, we were beyond the maximum capacity of the SAN in question. If I were to band-aid this problem, it would cost around $10,000 now, and be completely useless in just over a year. So, while $10,000 is an absolute steal when it comes to adding the capacity I wanted to a SAN, I felt like I couldn't justify a $10,000 hit for 12 months of useful service. There's so much more I'd rather do with that money!


Our second piece of the problem was parformance in another realm. We are already running up against the parformance capabilities of the SAN we were looking to upgrade! So this meant that while spending the money to upgrade its data capacity would solve one problem, it would actually make the second problem worse! And the more we rely on this SAN, the worse the performance problem gets. In fact, when I charted out the projected performace of the existing SAN over the next year, it would be so over utilized that everything we ask it to do would take more than 24 hours to complete. This is a pretty severe problem when your business hours are 7am to 7pm, needless to say.


So I went back to the drawing board. I puzzled over several possible solutions, but none of them really made me happy. I could spend $20,000 on a second SAN and use the two of them in parallel, I could steal some capacity from our backup SAN for our customer data, I could just add capacity to the existing SAN and hope the performance calculations were wrong, and of course I could do nothing. The last option was by FAR the best option on the table at the time.


Then it dawned on me. I had experimented with some different servers here, building a "home grown" SAN. Servers were cheap, and I could use any brand of hard drive I wanted. I wasn't locked into a single vendor, and the price tag was extremely appealing. For $10,000 you can build a Linux (or Windows) server with so many disks in it that you actually need specialy hardware to handle them all! I had aready been mulling over the idea of building a small Linux SAN for work, so why not build a massive one instead? I mulled this over for months until the opportunity presented itself. And by that I mean my back was against the storage wall, and a decision had to be made in just a few weeks.


Well, that day has come! While ordering software for my company, I negotiated a deal that saved around $3,500. This afforded me the opportunity to build a Linux server with room for 42 disk drives, a second CPU, more RAM than you could shake a stick at, and all the bells and whistles I wanted. This new system will give me the ability to expand throughput by adding much faster network cards and disk controllers, while allowing me to add affordable storage any time I want! What's better is that I will be modeling my system after ones used at CERN and LLNL- two massive agencies that have some serious storage needs.


Finally, after months of planning and research, the order has been placed and the parts are on their way. I'll be using 20 1TB hard drives, two or four 60GB SSD drives, 6GB of RAM (initially), a single Intel E5620 Xeon CPU, a 10Gbit ethernet card, and an Areca 24-port 6G SAS card in JBOD mode. The plan is to build a ZFS pool out of the disks, tune it to use the SSDs for data access acceleration, turn on the data de-duplication features, and benchmark the HELL out of it. Once I'm happy with the results, I'll be carving storage out of the pool and migrating things off the old SAN. Once everything is moved off the old SAN, I will add it to the ZFS pool as slower storage used for less frequently accessed data. As more storage is needed, the system will expand with the addition of $1,000 external disk chassis that can hold up to 24 disks each. The total disk limit of a single Areca card is 128 disks, and the system will be physically limited to two cards. I think 256 3+TB drives would be more than enough space to hold us over for the forseeable future.


I plan to keep good notes during this process, and I'm hoping to update this blog with the details as I make progress. I'll also be using the Phoronix benchmark suite to test access to the ZFS pool on the server itself and on iSCSI, CIFS, NFS, and AoE clients, so be prepared for lots of boring numbers.