Wednesday, March 23, 2016

The Time Machine: Baremetal Management is the new Backup Tape Changing

Before graduating high school, I’d been a paperboy, a bagboy, a dishwasher, facilities engineer for a ski resort (garbageboy), then moved up to rentals at that same ski resort. One of the primary reasons I picked the college I did was that it had a full-time job that I could take advantage of right away to earn some money, and more importantly, get experience in their computer department.

That first “professional” experience was helping connect the dorms to a brand new Ethernet network, the first dorm network connectivity. Before that, all they had were VAX OpenVMS terminals. Speaking of the VAX, the bulk of my time was spent managing the VAX and backups for it as well as 2 brand new DEC Alphas. At the time, making sure those backups were legit was my most important job. Those tapes were to a tape drive that was already an antique when I had to start watching over it. There were dailies and weeklies and monthly fulls, 100% of this was manual, and I did it from a line printer, not a monitor.



I could get called at any time of any day, and I had to restore any files accidentally deleted or corrupted from the old winchester hard drives attached to the VAX. Imagine an irate professor working late into the night on a weekend to finish research and losing a file for whatever reason. I was the person who needed to respond and fix it immediately without complaint. There was one occasion where the files were corrupted, and I spent an entire weekend not able to recover a professor’s culmination of a semester’s worth of chemistry research. I almost lost my job, almost lost my income to pay for tuition, almost lost any credibility in the CompSci department, pretty much almost lost everything I had worked for up until that weekend. This was one of the lowest points of my life and one of the turning points of my career, despite it basically being “undifferentiated heavy lifting,” and I was still making minimum wage, had a timecard, and was at the bottom of the ladder career-wise. I bet that professor still remembers me, if only to wonder whether I've been run over by a truck.

Luckily, I didn’t lose my job and got a second chance. I picked up an additional job working at a thermography company writing printer drivers for an AS/400. Again, basically lowest rung of the ladder to get experience, but this company’s core business was printing very elaborate wedding invitations, graduation announcements, etc. Again, this was a job that held the lowest pay and the highest responsibility, because if the AS/400 couldn’t print, the entire business was at a standstill. Since those early days, I knew, quite viscerally, that I could never get comfortable where I was.

Fast-forward to my career at Nutanix today. I talk to customers on a daily basis about running HPC, Big Data, and container workloads on baremetal. Don’t get me wrong, baremetal is worthy competition as there is nothing so self-service as your own dedicated, brand-new hardware. In my career, I worked for Argonne National Labs and there is no one in the world with a longer, more respectable track record for managing baremetal at scale than the Department of Energy labs. With a batch scheduler or even a multi-framework scheduling distributed system like Mesos, the baremetal becomes a distributed pool of compute. With HDFS or Elasticsearch or Cassandra for example, baremetal becomes a distributed pool of persistent storage.

So why not just use baremetal for these workloads? Well, Hadoop, for example, is great at distributed resiliency, however it does not manage the hardware for you. Sure, a drive can fail, nodes can fail, top-of-rack switches can fail, but does Hadoop recover failed hardware? Brand-new baremetal is great, but how long is that expected to last? What is the amortization and depreciation schedule? Just like driving a new car off the lot, new hardware innovation driven by silicon and server vendors is still in an ever-escalating competition so that by the time that fancy brand-new hardware is installed, it's already depreciating and may be rendered obsolete relatively quickly. The advent of “software-defined” has not slowed down that deathmarch.

There have been many tools created to alleviate these concerns and make it easier to handle hardware management. Cobbler, Razor, and now RACKHD, for example, are stabs in the right direction. Web-scale companies who maintain public clouds or just a ton of infrastructure and services like Facebook and Twitter have built the necessary tooling to scale their own hardware management efforts, but how is this composable or consumable outside of their respective platforms? Not to mention there’s simplifying hardware compatibility and then there’s trying to accommodate any hardware where the rows and columns of interoperability represent an exponentially growing opportunity for issues. Where Nutanix really shines is the infrastructure, the tooling, and the team behind making this the platform for simplifying hardware management for myriad applications at scale.

These baremetal clusters running Hadoop or Mesos are truly responsible for the life-blood of the business, from its data to its second-by-second operation. If you’re running on baremetal, to borrow from my early experiences changing tapes and tweaking printer drivers, you are still stuck spending time on the most menial part of the infrastructure. More value is derived from the systems and data built on the hardware not just the hardware itself, which should be no surprise, so why not spend more time on that? From the H. G. Wells novel, you are dependent on the Morlocks, those tape-changing, baremetal-replacing denizens of the datacenter, to keep up their ceaseless, yet thankless duties. Where I see customers able to take advantage of Nutanix is shifting the time spent to more fruitful pursuits to expanding their intelligence and their careers. Besides full-time Morlocks, plenty of people can get trapped into doing this part-time, beholden to esoteric troubleshooting of the nuances of hardware.

“There is no intelligence where there is no need of change.” - H.G. Wells, The Time Machine

I can imagine if I had not tried to advance my career from changing tapes, that I easily would not be where I am today. If I had been content swapping tapes and performing on-demand restores 24x7, I would have been miserable until I was obsolete. If I had been content configuring ‘bin’ files and configuring Symmetrix nights and weekends, I would have been miserable until I was obsolete.  And so on with just building VM’s and workflows for managing VM’s and hypervisors.

“An animal perfectly in harmony with its environment is a perfect mechanism. Nature never appeals to intelligence until habit and instinct are useless. There is no intelligence where there is no change and no need of change. Only those animals partake of intelligence that have a huge variety of needs and dangers.”  - H.G. Wells, The Time Machine

Of course, this is nothing new from what AWS or other public clouds accomplish for their customers. How much hardware management do I have to do for my AWS usage? Absolutely zero. It has always been zero, and I expect it to always be zero. One of AWS’s secrets to success, in my opinion, is that it emulates the feeling of getting brand-new hardware all the time. If I want a new instance, it’s just like brand-new and only an API call away, cost-permitting of course.

Why turn very smart, very ambitious people into Morlocks, if you’re making your admins spend their critical career-time on provisioning, managing, and troubleshooting hardware? Instead, help them focus on the next-generation of applications or analytics or programming frameworks that make them grow. Help them be heroes to their partners or teams or maybe most importantly to themselves.

“We should strive to welcome change and challenges, because they are what help us grow. With out them we grow weak like the Eloi in comfort and security. We need to constantly be challenging ourselves in order to strengthen our character and increase our intelligence. ” - H.G. Wells, The Time Machine