dev2ops on Twitter
DevOps Toolchain Project
Interested in DevOps?
Search dev2ops
Subscribe
Monday
Nov022009

6 Months In: Fully Automated Provisioning Revisited

It's been about six months since I co-authored the "Web Ops 2.0: Achieving Fully Automated Provisioning" whitepaper along with the good folks at Reductive Labs (the team behind Puppet). While the paper was built on a case study about a joint user of ControlTier and Puppet (and a joint client of my employer, DTO Solutions, and Reductive Labs), the broader goal was to start a discussion around the concept of fully automated provisioning.

So far, so good. In addition to the feedback and lively discussion, we've just gotten word of the first independent presentation by a community member. Dan Nemec of Silverpop made a great presentation at AWSome Atlanta (a cloud computing technology focused meetup). John Willis was kind enough to record and post the video:

I'm currently working on an updated and expanded version of the whitepaper and am looking for any contributors who want to participate. Everything is being done under the Creative Commons (Attribution - Share Alike) license.

The core definition of "fully automated provisioning" hasn't changed: the ability to deploy, update, and repair your application infrastructure using only pre-defined automated procedures.

Nor has the criteria for achieving fully automated provisioning:

  1. Be able to automatically provision an entire environment -- from "bare-metal" to running business services -- completely from specification
  2. No direct management of individual boxes
  3. Be able to revert to a "previously known good" state at any time
  4. It’s easier to re-provision than it is to repair
  5. Anyone on your team with minimal domain specific knowledge can deploy or update an environment


The representation of the open source toolchain has been updated and currently looks like this:

 



The new column on the left was added to describe the kind of actions that takes place at the corresponding layer. The middle column shows each layer of the toolchain. In the right column are examples of existing tools.

There are some other areas that are currently being discussed:

1. Where does application package management fall?
This is an interesting debate. Some people feel that all package distribution and management (system and application packages) should take place at the system configuration management layer. Others think that it's appropriate for the system configuration management layer to handle system packages and the application service deployment layer to handle application  and content packages.


2. How important is consistency across lifecycle?
It's difficult to argue against consistency, but how far back into the lifecycle should the fully automated provisioning system reach? All Staging/QA environments? All integrated development environments? Individual developer's systems? It's a good rule of thumb to deal with non-functional requirements as early in the lifecycle as possible, but that imposes an overhead that must be dealt with.


 
3. Language debate
With a toolchain you are going to have different tools with varying methods of configuration. What kind of overhead are you adding because of differing languages or configuration syntax? Does individual bias towards a particular language or syntax come into play? Is it easier to bend (or some would say abuse) one tool to do most of everything rather than use a toolchain that lets each tool do what its supposed to be good at?

4. New case study
I'm working on adding additional case studies. If anyone has a good example of any part of the toolchain in action, let me know.

Monday
Sep282009

Q&A: Lee Thompson, former Chief Technologist of E*TRADE Financial

I recently caught up with Lee Thompson to discuss a variety of Dev to Ops topics including deployment, testing, and the conflict between dev and ops.

 

Lee recently left E*TRADE Financial where he was VP & Chief Technologist. Lee's 13 years at E*TRADE saw two major boom and bust cycles and dramatic expansion in E*TRADE's business.

 

 

Damon:
You've had large scale ops and dev roles... what lessons have you've learned the required the perspective of both roles?

Lee:
I was heavy into ops during the equity bubble of 1998 to 2000 and during that time we scaled E*TRADE from 40,000 trades a day to 350,000 trades a day. After going through that experience, all my software designs changed. Operability and deployability became large concerns for any kind of infrastructure I was designing. Your development and your architecture staff have to hand the application off to your production administrators so the architects can get some sleep. You don't want your developers involved in running the site. You want them building the next business function for the company. The only way that is going to happen is to have the non-functional requirements --deployability, scalability, operability--  already built into your design. So that experience taught me the importance of non-functional requirements in the design process.

 

Damon:
You use the phrase "wall of confusion" a lot... can you explain the nature of the dev and ops conflict?

Lee:
When dealing with a large distributed compute infrastructure you are going to have applications that are difficult to run. The operations and systems engineering staff who is trying to keep the business functions running is going to say "oh these developers don't get it". And then back over on the developer side they are going to say "oh these ops guys don't get it". It's just a totally different mindset. The developers are all about changing things very quickly and the ops team is all about stability and reliability. One company, but two very different mindsets. Both want the company to succeed, but they just see different sides of the same coin. Change and stability are both essential to a company's success.

 

Damon:
How can we break down the wall of confusion and resolve the dev and ops conflicts?

Lee:
The first step is being clear about non-functional requirements and avoid what I call "peek-a-boo" requirements.

Here's a common "peek-a-boo" scenario:
Development (glowing with pride in the business functions they've produced): "Here's the application"
Operations: "Hey, this doesn't work"
Development: "What do you mean? It works just fine"
Operations: "Well it doesn't run with our deployment stack"
Development: "What deployment stack?"
Operations: "The stuff we use to push all of the production infrastructure"

The non-functional requirements become late cycle "peek-a-boo" requirements when they aren't addressed early in development. Late cycle requirements violates continuous integration and agile development principles.  The production tooling and requirements have to be accounted for in the development environment but most enterprises don't do that. Since the deployment requirements aren't covered in dev, what ends up happening is that the operations staff receiving the application has to do innovation through necessity and they end up writing a number of tools that over time become bigger and bigger and more of a problem which Alex described last year in the Stone Axe post. Deployability is a business requirement and it needs to be accounted for in the development environment just like any other business requirement.

 

Damon:
Deployment seems to be topics of discussion that are increasing in popularity... why is that?

Lee:
Deployability and the other non-functional requirements have always been there, they were just often overlooked. You just made do. But a few things have happened.

1. Complexity and commodity computing, both in the hardware and software, has meant that your deployment is getting to the point where automation is mandatory. When I started at E*TRADE there were 3 Solaris boxes. When I left the number of servers was orders and orders of magnitude larger (the actual number is proprietary). Since the operations staff can't afford to log into every box, they end up writing tools because the developers didn't give them any.

2. Composite applications, where applications are composed of other applications, mean that every application has to be deployed and scaled independently.  Versioning further complicates matters. Before the Internet, in the PC and mainframe industries, you were used to delivering and maintaining multiple versions of software simultaneously. In the early days of the Internet, a lot of that went away and you only had two versions -- the one deployed and the one about to be deployed. Now with the componentization of services and the mixing and matching of components you'll find that typically you have several versions of a piece of infrastructure facing different business functions. So you might be running three or four independently deployed and managed versions of the same application component within your firewall.  

3. Cloud computing takes both complexity and the need for deployability up another notch. Now you are looking to remote a portion of your infrastructure into the Internet. Almost everyone who is starting up a company right now is not talking about building a datacenter, they are all talking about pushing to the cloud. So deployability is very much at the forefront of their thinking about how to deliver their business functions.  And the cloud story is only beginning. For example, what happens when you get a new generation of requirements like the ability to automate failover between cloud vendors?

 

Damon:
Testing is one of those things that everyone knows is good, but seems to rarely get adequately funded or properly executed. Why is that?

Lee:
Well, like many things it's often simply a poor understanding of what goes into doing it right and an oversimplification of what the business value really is.

Just like deployment infrastructure, proper testing infrastructure is a distributed application in of itself. You have to coordinate the provisioning of a mocked up environment that mimics your production conditions and then boot up a distributed application that actually runs the tests. The level of thought and effort that has to go into properly doing that can't be overlooked. Well, not if you are serious about delivering on quality of service.

While integration testing should be a very important piece of your infrastructure, the importance of antagonistic testing also can't be overlooked. For example, the CEO is going to want to know what happens when you double the load on your business. The only way to really know that is to have a good facsimile of your business application mocked-up and those exact scenarios tested. That is a large scale application in of itself and takes investment.

Beyond service quality, there is business value in proper testing infrastructure that is often overlooked. When you start to build up a large quantity of high fidelity tests those tests actually represent knowledge about your business.  Knowledge is always an asset to your business. It's pretty clear that businesses who know a lot about themselves tend to do well and those who lack that knowledge tend not to be very durable. 

 

Damon:
The culture of any large company is going to restrictive. Large financial institutions are, by their very nature, going to be more restrictive. Coming out of that culture, what are you most excited to do?

Lee:
Punditry, blogging and using social media to start! You really can't do that from behind the firewall in the FI world. There are legitimate reasons for the restrictions, and I understand that. Because you have to contend with a lot of regulatory concerns, you just aren't going to see a lot of financial institution technologist ranting online about what is going on behind the firewall. So I'm excited about becoming a producer of social media content rather than just a consumer.

I also find consulting exciting. It's been fun getting out there and seeing a variety of companies and how similar their problems are to each other and to what I worked on at E*TRADE. It reminds me how advanced E*TRADE really is and what we had to contend with. I enjoy applying the lessons I've learned over my career to helping other companies avoid pitfalls and helping them position their IT organization for success.

Friday
Sep252009

"Stability anti-patterns" highlight importance of tracking non-functional requirements

Michael Nygard, author of the influential Release It!, delivered a fantastic speech at QCon London where he dives into the various flavors of "stability anti-patterns" that can (and probably will) plague your web operations.

 


The explicit lessons alone are worth watching his presentation. However, there is also a more subtle lesson delivered over the course of the speech. Whether it was intentional or not, Michael illustrates the importance (and difficulty) of tracking non-functional requirements across the application lifecycle.

 

Some of the knowledge of where your next failure could come from lives in Development. Some of it lives in Operations. Sometimes it even lives in the well-intentioned plans of your marketing department.

What are you doing about sharing and tracking that knowledge? If you are like most organizations, you are probably relying on tribal knowledge to avoid these pitfalls. Development handles what they know about, Operations handles what they know about, and everyone crosses their fingers and hopes it all works out. Unlike the well understood processes and tooling used to track business requirements, non-functional requirements all too often fly under the radar or are afterthoughts to be handled during production deployment.

Monday
Aug312009

Are sys admins soon to be relics?

One of the ideas that can be extrapolated from the positions of the "infrastructure as code" crowd, is that the future of systems administration will look dramatically different than it does today.*

The extreme view of the future is that you'll have a set of domain experts (application architects/developers, database architects, storage architects, performance management, platform security, etc.) who produce the infrastructure code and everything else happens automatically. The image of today's workhorse, pager wearing, fire extinguishing sys admin doesn't seem to have a role in that world.

Of course, the reality will be somewhere in the pragmatic middle. But a glimpse of that possible future should make sys admins question which direction they are taking their job skills.

I finally got around to digging into the conference wrap up report that O'Reilly publishes after its annual web operations conference, Velocity. Most of it was the standard self-serving kudos. However, the table below really caught my eye and inspired me to write this post.

Attendee Job Titles (multiple answers accepted)

  • Software Developer 60%
  • IT Management/Sys Admin 27%
  • Computer Programmer 20%
  • CXO/Business Strategist 19%
  • Web/UI Design 17%
  • Business Manager 16%
  • Product Manager 10%
  • Consultant 9%
  • Entrepreneur 8%
  • Business Development 4%
  • Community Activist 3%
  • Marketing Professional 2%
  • Other 5%

Now of course you have to look at this data with a cautious eye. People were asked to self-describe, you could select multiple titles, some people where attending to learn about design tricks for faster page load times, and most people blow through marketing surveys without much care. However, it did catch my eye that somewhere between 60 - 80% described themselves as having a development role. Only 27% described themselves as having a sys admin role.

Now is it a big leap to point to this data as an early warning signal of the demise of the traditional sys admin role? Probably... but it fully jibes with the anecdotal evidence we saw around the conference halls. From large .com employees (Facebook, Twitter, Google, Flickr, Shopzilla, etc..) to the open source tools developers, the thought (and action) leaders were developers who happened to focus on systems administration, not systems administrators who happened to have development skills.

* Disclosure: I'm a member of the infrastructure as code crowd

Friday
Jul032009

Tools are easy. Changing your operating culture is hard.

Did you ever notice that our first inclination is to reach for a tool when we want to change something? What we always seem to forget is that web operations, as a discipline, is only partially about technology.

The success of your web operations depends more on the processes and culture your people work within than it does on any specific tooling choices. Yet, as soon as you start talking about changing/improving operations the conversation quickly devolves into arguments about the merits of various tools.

We see this repeatedly in our consulting business. Time after time we are called in to do a specific automation project and wind up spending the bulk of the effort as counselors and coaches helping the organization make the cultural shift that was the real intention of the automation project.

This article from InfoQ on how difficult it is to get development organizations to adopt an agile culture is a superb encapsulation of the difficulty of cultural change. Switch the word "development" to "web operations" and switch "Agile" to any cultural change you want to make and the article still holds up.

This condition really shouldn't be a surprise to any of us. After all, how much time do we really spend, as an industry, discussing cultural, process, and organizational issues? Compare the number of books and articles written about the people part of web operations vs. the number of books and articles written about the people part of web operations. The ratio is abysmal, especially when you compare it to other types of business operations (manufacturing, finance, service industries, etc.)

UPDATE: The Agile Executive brings up the point that tools are valuable for enforcing new behavior. I definitely agree with that... but still maintain that new tools without a conscious decision to change behavior and culture is, more often than not, an approach that will fail.