Wednesday, June 22, 2016

Wednesday, March 23, 2016

The Time Machine: Baremetal Management is the new Backup Tape Changing

Before graduating high school, I’d been a paperboy, a bagboy, a dishwasher, facilities engineer for a ski resort (garbageboy), then moved up to rentals at that same ski resort. One of the primary reasons I picked the college I did was that it had a full-time job that I could take advantage of right away to earn some money, and more importantly, get experience in their computer department.

That first “professional” experience was helping connect the dorms to a brand new Ethernet network, the first dorm network connectivity. Before that, all they had were VAX OpenVMS terminals. Speaking of the VAX, the bulk of my time was spent managing the VAX and backups for it as well as 2 brand new DEC Alphas. At the time, making sure those backups were legit was my most important job. Those tapes were to a tape drive that was already an antique when I had to start watching over it. There were dailies and weeklies and monthly fulls, 100% of this was manual, and I did it from a line printer, not a monitor.



I could get called at any time of any day, and I had to restore any files accidentally deleted or corrupted from the old winchester hard drives attached to the VAX. Imagine an irate professor working late into the night on a weekend to finish research and losing a file for whatever reason. I was the person who needed to respond and fix it immediately without complaint. There was one occasion where the files were corrupted, and I spent an entire weekend not able to recover a professor’s culmination of a semester’s worth of chemistry research. I almost lost my job, almost lost my income to pay for tuition, almost lost any credibility in the CompSci department, pretty much almost lost everything I had worked for up until that weekend. This was one of the lowest points of my life and one of the turning points of my career, despite it basically being “undifferentiated heavy lifting,” and I was still making minimum wage, had a timecard, and was at the bottom of the ladder career-wise. I bet that professor still remembers me, if only to wonder whether I've been run over by a truck.

Luckily, I didn’t lose my job and got a second chance. I picked up an additional job working at a thermography company writing printer drivers for an AS/400. Again, basically lowest rung of the ladder to get experience, but this company’s core business was printing very elaborate wedding invitations, graduation announcements, etc. Again, this was a job that held the lowest pay and the highest responsibility, because if the AS/400 couldn’t print, the entire business was at a standstill. Since those early days, I knew, quite viscerally, that I could never get comfortable where I was.

Fast-forward to my career at Nutanix today. I talk to customers on a daily basis about running HPC, Big Data, and container workloads on baremetal. Don’t get me wrong, baremetal is worthy competition as there is nothing so self-service as your own dedicated, brand-new hardware. In my career, I worked for Argonne National Labs and there is no one in the world with a longer, more respectable track record for managing baremetal at scale than the Department of Energy labs. With a batch scheduler or even a multi-framework scheduling distributed system like Mesos, the baremetal becomes a distributed pool of compute. With HDFS or Elasticsearch or Cassandra for example, baremetal becomes a distributed pool of persistent storage.

So why not just use baremetal for these workloads? Well, Hadoop, for example, is great at distributed resiliency, however it does not manage the hardware for you. Sure, a drive can fail, nodes can fail, top-of-rack switches can fail, but does Hadoop recover failed hardware? Brand-new baremetal is great, but how long is that expected to last? What is the amortization and depreciation schedule? Just like driving a new car off the lot, new hardware innovation driven by silicon and server vendors is still in an ever-escalating competition so that by the time that fancy brand-new hardware is installed, it's already depreciating and may be rendered obsolete relatively quickly. The advent of “software-defined” has not slowed down that deathmarch.

There have been many tools created to alleviate these concerns and make it easier to handle hardware management. Cobbler, Razor, and now RACKHD, for example, are stabs in the right direction. Web-scale companies who maintain public clouds or just a ton of infrastructure and services like Facebook and Twitter have built the necessary tooling to scale their own hardware management efforts, but how is this composable or consumable outside of their respective platforms? Not to mention there’s simplifying hardware compatibility and then there’s trying to accommodate any hardware where the rows and columns of interoperability represent an exponentially growing opportunity for issues. Where Nutanix really shines is the infrastructure, the tooling, and the team behind making this the platform for simplifying hardware management for myriad applications at scale.

These baremetal clusters running Hadoop or Mesos are truly responsible for the life-blood of the business, from its data to its second-by-second operation. If you’re running on baremetal, to borrow from my early experiences changing tapes and tweaking printer drivers, you are still stuck spending time on the most menial part of the infrastructure. More value is derived from the systems and data built on the hardware not just the hardware itself, which should be no surprise, so why not spend more time on that? From the H. G. Wells novel, you are dependent on the Morlocks, those tape-changing, baremetal-replacing denizens of the datacenter, to keep up their ceaseless, yet thankless duties. Where I see customers able to take advantage of Nutanix is shifting the time spent to more fruitful pursuits to expanding their intelligence and their careers. Besides full-time Morlocks, plenty of people can get trapped into doing this part-time, beholden to esoteric troubleshooting of the nuances of hardware.

“There is no intelligence where there is no need of change.” - H.G. Wells, The Time Machine

I can imagine if I had not tried to advance my career from changing tapes, that I easily would not be where I am today. If I had been content swapping tapes and performing on-demand restores 24x7, I would have been miserable until I was obsolete. If I had been content configuring ‘bin’ files and configuring Symmetrix nights and weekends, I would have been miserable until I was obsolete.  And so on with just building VM’s and workflows for managing VM’s and hypervisors.

“An animal perfectly in harmony with its environment is a perfect mechanism. Nature never appeals to intelligence until habit and instinct are useless. There is no intelligence where there is no change and no need of change. Only those animals partake of intelligence that have a huge variety of needs and dangers.”  - H.G. Wells, The Time Machine

Of course, this is nothing new from what AWS or other public clouds accomplish for their customers. How much hardware management do I have to do for my AWS usage? Absolutely zero. It has always been zero, and I expect it to always be zero. One of AWS’s secrets to success, in my opinion, is that it emulates the feeling of getting brand-new hardware all the time. If I want a new instance, it’s just like brand-new and only an API call away, cost-permitting of course.

Why turn very smart, very ambitious people into Morlocks, if you’re making your admins spend their critical career-time on provisioning, managing, and troubleshooting hardware? Instead, help them focus on the next-generation of applications or analytics or programming frameworks that make them grow. Help them be heroes to their partners or teams or maybe most importantly to themselves.

“We should strive to welcome change and challenges, because they are what help us grow. With out them we grow weak like the Eloi in comfort and security. We need to constantly be challenging ourselves in order to strengthen our character and increase our intelligence. ” - H.G. Wells, The Time Machine



Monday, January 11, 2016

Stay out of my way Nutanix

After being a specialist at Nutanix for 6 months, the difference in the way I spend my time is significant. I spend more time focused on application platforms and able to dive into a Hadoop distro or Elasticsearch and their associated tools with customers any time I want. I don’t spend much time in Prism or working with storage. In fact, I spend as little amount of time touching any Nutanix settings except to spin up new batches of apps.

Prism and the AHV stays out of my way. I spend what little time in Prism that I do cloning from a couple templates and then I’m done. I spend a little bit more time in Chef (it’s where I cut my teeth) and Saltstack (personal preference and speed). That I have more time to do this means I deploy platforms fast and can switch up my environment mix quickly and relatively easy. When it’s not easy, it means I’m learning something new about the application environment; which is great.

Besides things like Hadoop and Elasticsearch, I spend time working on platforms like Cloud Foundry and Kubernetes and Mesos. The nuanced differences as well as similarities in these are really fascinating to watch and work with customers. This is what I want to spend my time on. With all the time I spend speaking with customers, this is where customer admins would rather spend their time as well. These are the platforms their developers and line-of-business owners are staking their companies on when they say, “We need to do something with all of this data” or “We need to change our apps faster”. It’s integral to their jobs that they understand what’s going on here and how these are evolving.

I don’t spend a lot of time worrying about vm-centric management interfaces. The granularity is wrong for what I work on now and I don’t need to account for any features in a virtualization layer. Of course, if you string enough artificial management and automation layers together, you can build apps. I know that. I’ve done that, but it takes away time I could be spending directly in my platform workflow.

I don’t worry about storage provisioning or allocation like luns or RAID groups or arbitrary constructs like vSphere clusters or resource pools. I do however think about storage performance a lot differently. I can focus on scaling and sharding. I can focus on using intrinsic performance tools that help me actually have a performance dialogue with customers vs just performance confrontations. I can ask more intelligent questions about the workload, the data, and how the data transformation pipeline works.

For example, one of the arguments I’ve heard a lot is whether separating compute and storage so that they can be scaled independently is beneficial. One of the problems there is that usually means you are always dealing with one or the other being a bottleneck since it’s exceedingly rare that we can do capacity planning without any constraints. Also, when was the last time a workload didn’t need to actually pull or push any IO? The rate and variability is key to differentiating how data flows through any useful system. However, I would not have been as familiar with this in my specific areas of focus had I not been working for Nutanix. I can spend time in the application stack, looking at scaling, looking at the working set and better understanding where exactly any given IO is landing because I have that time. I’m not worrying about the virtualization layer or virtual infrastructure management that doesn’t help me learn more about the app platforms customers really care about.


In all, I feel well-rewarded working with Nutanix and the customers I speak with every day, around the world. I am reminded of working on AWS instances as AWS could also care less about forcing you to understand its virtualization layer, if you even know it has one. I use AWS or Nutanix and focus on what I need to build and what I want to learn about today, instead of saying, I’d love to learn more about something like Spark machine learning or Kubernetes 1.1, but only after I am done getting all of this virtual infrastructure patched up properly. I can also trace how this works and contrast something I did in AWS vs a Nutanix cluster because the management approach of both is very similar. Don’t bog me down with hardware management or virtual machine management. Let me build and learn and stay out of my way.