Virtual-hiking: May 2014

Thursday, May 29, 2014

Quick Random Question #2

In a high latency, low bandwidth ROBO deployment scenario, how can a customer optimize their ESXi image for deployment so that the remote site will not suffer due to VIB deployment, function independently and allow for unreliable connectivity?

For starters:
1. On a vanilla ESX image, run "esxcli software vib list" to get the baseline of included vibs.
2. On one of their prod ESX images (clustered, updated, drivers, everything to match what a prod host will look like at the remote site), run "esxcli software vib list" to get all the vibs they'll need.
3. Use ESXi image builder to create the software depot you need and they can export an iso to boostrap any remote hosts.

Additional links
Installing patches on an ESXi 5.x host from the command line
http://blogs.vmware.com/vsphere/2012/04/using-the-vsphere-esxi-image-builder-cli.html

Hadoop Summit and Hadoop as a Service

In the beginning of this year, I mentioned working on Big Data and virtualization and it has been a fruitful time. Next week I will be co-presenting with Chris Mutchler from Adobe on "Hadoop-as-a-Service for Lifecycle Management Simplicity" at the Hadoop Summit conference in San Jose, CA. Our session will be on Wednesday from 4:35pm-5:15pm.

I am humbled and excited to help present alongside other sessions from some of the most respected names in the industry from Yahoo!, Google, Cloudera, Hortonworks, MapR, Microsoft. The growing depth, evolution, and community of the Big Data ecosystem is impressive, to say the least. I hope to attend other Hadoop customer sessions as well as investigate what other large players are accomplishing from their respective stacks. I see a lot of advanced sessions around new use-cases for Hadoop and research of adding additional layers and abstractions to Hadoop. The Adobe session is focused on the usability of Hadoop from an IT operations perspective with a few key points to make:

Explain why virtualizing Hadoop is good from a business, techincal and operational perspective
Accommodate the evolution and diversity of Big Data solutions
Simplify the lifecycle deployment of these layers for engineering and operations.
Create Hadoop-as-a-Service with VMware vSphere, Big Data Extensions, and vCloud Automation Center

Hadoop is truly becoming a complicated stack at this point. After I started actually getting hands on and working with customers on Hadoop-specific projects in 2011, I found that calling this new technology Hadoop seemed a bit disingenuous. There was really MapReduce and HDFS, a compute layer and a storage layer. Even though they were tightly coupled, that was enforced for very good and simple reasons. Spending even more time on this has given more perspective on the different layers and their corresponding workloads. Unless you're only running one type of job for your compute layer and sucking in data from a static set of sources for your data layer, then these workloads will vary as well as vary independently. However, in a physical world, with both layers exactly coupled, how can they scale independently and flexibly?

Enter virtualization and everything I've been working on around virtualizing distributed systems, data analytics, Hadoop, and so forth. Consider the layering of functionality for different distributions and look for the similarities. If you take a look at Cloudera:

Or Hortonworks:

And Pivotal HD:

As a wise donkey once talked about in a movie when describing onions and cakes, they all have layers and so does any next-gen analytics platform. Now we have our data layer, and then a scheduling layer, then on top of that we can look at batch jobs, SQL jobs, streaming, machine learning, etc. Many moving parts and each one with probable workload variability per application, per customer. What abstraction layer helps pool resources, dynamically move components for elastic scale-in and scale-out and allows for flexible deployment of these many moving parts? Virtualization is a good answer, but also one of the first questions I get is "How's performance?" Well, I have seen vSphere scale and perform next to baremetal. Listed below is the link to the performance whitepaper detailing performance recommendations that have been tested on large clusters.

Speaking of all these layers, this leads to complexity very quickly so another angle specifically to the Adobe Hadoop Summit presentation is around hiding this complexity from the end-developers and making it easier and faster to develop, prototype, and release their analytics features into production. Some sessions are exploring even deeper, more complex uses of Hadoop and I am eager to see their results, however, enabling this lifecycle management for ops is essential to adoption of the latest functionality of any vendors' Big Data stack. VMware's Big Data Extensions, and in this case with vCloud Automation Center, allows for self-service consumption and easier experimentation. There's a (disputed) quote that has been attributed to Einstein that states "Everything should be made as simple as possible, but not simpler." There are a few vendors working on making Hadoop easier to consume and I would argue simplifying consumption of this technology is a worthwhile goal for the community. Dare I say even Microsoft's vision of allowing Big Data analysis via Excel is actually very intriguing if they can make it that simple to consume.

Another common question I get is "Virtualization is fine for dev/test, but how is it for production?" First, simplicity, elasticity, and flexibility are even more important to a production environment. And maybe more importantly, let's not discount the importance of experimentation to any software development lifecycle. As much as Hadoop enterprise vendors would like to make any Hadoop integration turnkey with any data source, any platform, any applications, I would argue we have a long way to go. Any innovation depends on experimentation and the ability to test out new algorithms, replacing layers of the stack, evaluating and isolating different variables in this distributed system.

One more assumption that keeps coming up is the perception that 100% utilization on a Hadoop cluster equals a high degree of efficiency. I am not a Java guru or an expert Hadoop programmer by any means, but if you think about it, it would be very easy for me to write something that drives a Yahoo! scale set of MapReduce nodes to 100% utilization but which really gives me no benefit whatsoever. Now take that a step further as that job can have some benefit to the user, but still be very resource inefficient. Quantifying that is worthy of more research but for now, optimizing the efficiency of any type of job or application specification will allow better business and operational intelligence to an organization and actually make their data lake (pond, ocean, deep murky loch?) worth the money.

Add to these business and operational justifications the added security posture:
http://virtual-hiking.blogspot.com/2014/04/new-roles-and-security-in-virtualized.html
and now you should have a much better idea of the solutions that forward-thinking customers are adopting to weaponize their in-house and myriad vendor analytics platforms.

Really exciting tech and hope to see you next week in San Jose!

Additional links
Hadoop Summit:
http://hadoopsummit.org/san-jose/schedule/
http://hadoopsummit.org/san-jose/speakers/#andrew-nelson
http://hadoopsummit.org/san-jose/speakers/#chris-mutchler
Hadoop performance case study and recommendations for vSphere:
http://blogs.vmware.com/vsphere/2013/05/proving-performance-hadoop-on-vsphere-a-bare-metal-comparison.html
http://www.vmware.com/files/pdf/techpaper/hadoop-vsphere51-32hosts.pdf
Open source Project Serengeti for Hadoop automated deployment on vSphere:
http://www.projectserengeti.org/
vSphere Big Data Extensions product page:
http://www.vmware.com/products/vsphere/features-big-data
How to set up Big Data Extensions workflows through vCloud Automation Center v6.0:
https://solutionexchange.vmware.com/store/products/hadoop-as-a-service-vmware-vcloud-automation-center-and-big-data-extension#.U4broC8Wetg
Big Data Extensions setup on vSphere:
https://www.youtube.com/watch?v=KMG1QlS6yag
HBASE cluster setup with Big Data Extensions:
https://www.youtube.com/watch?v=LwcM5GQSFVY
Big Data Extensions with Isilon:
https://www.youtube.com/watch?v=FL_PXZJZUYg
Elastic Hadoop on vSphere:
https://www.youtube.com/watch?v=dh0rvwXZmJ0

Tuesday, May 6, 2014

Virtualized HPC and Customer Sessions For Your Consideration...VMworld 2014

Now that the floodgates are open, I wanted to let you know about some key sessions regarding virtualized high performance computing applications, analytics and Big Data.
https://vmworld2014.activeevents.com/scheduler/publicVoting.do

Session 1682: Agile HPC-as-a-Service with VMware vCloud Automation Center
What I am aiming at is bringing the simplicity of deployment and integration that is available with Big Data Extensions to HPC clusters. A simple idea that's already gotten very complicated very quickly but worth the effort to research.

Session 1688: Is Someone Bitcoin Mining on my Cluster? High Performance Security for Virtualized High Performance Computing
If we have these targeted compute clusters, what are we doing to make sure they are being used appropriately? This issue will only become more prevalent so why not learn how to get in front of it?

Session 1428: Hadoop as a Service: Utilizing VMware vCloud Automation Center and Big Data Extensions at Adobe
Real world details of a customer I have worked with virtualizing and automating Hadoop deployments, from Chris Mutchler of Adobe. This will be focused on the automation and self-service flexibility gained through virtualization and leveraging BDE.

Session 1424: Massively scaled VSAN Implementations
Another real customer implementation detailed with Frans Van Rooyen, a Compute Platform Architect at Adobe, around their use-case for VSAN for large-scaled analytics.

Session 1466: High-Performance Computing in the Virtualized Datacenter
Edmond DeMattia from the JHU Applied Physics Laboratory discusses his latest real world experience at scale of pooling virtualized compute clusters. His session was a hit last year and I hope he gets a chance to update everyone this year on his work.

Session 1856: How to Engage with Your Engineering, Science, and Research Groups About Virtualization and Cloud Computing
This session is being done by my two good friends Matt Herreras, systems engineering manager for SLED and Josh Simons who works for the Office of the CTO focused on HPC. For virtualizing HPC to work, getting the buy-in from the end-user is definitely necessary.

Session 2508: Extreme Computing on vSphere
Both Josh Simons and Bhavesh Davda from the Office of the CTO at VMware presenting on their virtualization of latency-sensitive workloads using Infiniband, GPGPUs and Xeon Phi from virtual machines.

Session 1539: Why the Hypervisor isn't a Commodity. Performance Best Practices from 3 Tier to HPC workloads
Bhavesh Davda and Aaron Blasius, a Product Line Manager for vSphere, discuss performance tips and tricks from the VMKernel.

Session 1232: Reference Architectures and Best Practices for Hadoop on vSphere
Justin Murray from VMware Tech Marketing and Chris Greer from Fedex discuss their current architecture for Hadoop on vSphere as well as look at how this is evolving for the next generation of Hadoop.

Session 1697: Hadoop on vSphere for the Enterprise
Joe Russell's, PM of Storage and Big Data for VMware, first customer panel discussing their experiences of virtualizing Hadoop including Northrop Grumman, Fedex, and Adobe.

Session 1807: Best Practices of Virtualizing Hadoop on vSphere - Customer Panel
Joe Russell's second customer panel including Adobe, Wells Fargo, and GE.

Customer Sessions:
Also, in addition to the customer panels, Adobe and JHU APL sessions, I would highly recommend voting for customer sessions on the given technologies that you want to see. I always get asked for customer references for all different kinds of technologies and now is the opportunity for you to invite customers who want to talk about their implementations and lessons learned. Please take advantage and help validate the work that these customers have done to advance the community. A few examples in no particular order and certainly not a definitive list:
1400, Kroger and their ROBO implementation
2382, 2505, Symantec and their cloud implementation
2463, Francis Drilling Fluids and their SDDC
2770, University of Wisconsin and migrating to the vCenter Server Appliance
1526, Greenpages/LogisticsOne and VSAN
1635, MolsonCoors and their SDDC, including virtualizing SAP, using vCOps, VIN, and SRM
1897, Boeing and their ITaaS
2687, McAfee and Intel elastic cloud
2385, McKesson OneCloud
2285, Grizzly Oil and Horizon View

Thanks for taking the time to read and vote!