Blog - dev2ops - Solving Large Scale Web Operations and DevOps Problems

Entries in Q&A (30)

Tuesday

Apr272010

Q&A: Ernest Mueller on bringing Agile to operations

Tuesday, April 27, 2010 at 1:15PM

"I say this as somebody who about 15 years ago chose system administration over development. But system administration and system administrators have allowed themselves to lag in maturity behind what the state of the art is. These new technologies are finally causing us to be held to account to modernize the way we do things. And I think that’s a welcome and healthy challenge."

-Ernest Mueller

I met Ernest Mueller back at OpsCamp Austin and have been following his blog ever since. As a Web Systems Architect at National Instruments, Ernest has had some interesting experiences to share. Like so many of us, he's been "trying to solve DevOps problems long before the word DevOps came around"!

Ernest was kind enough to submit himself to an interview for our readers. We talked at length about his experiences brining Agile principles to the operations side of a traditional large enterprise IT environment. Below are the highlights of that conversation. I hope you get as much out of it as I did.

Damon:
What are the circumstances are that led you down the path of bringing the Agile principles to your operations work?

Ernest:
I've been at National Instruments for, oh, seven years now. Initially, I was working on and leading our Web systems team that handled the systems side of our Web site. And over time, it became clear that we needed to have more of a hand in the genesis of the programs that were going out onto the Web site, to put in those sorts of operational concerns, like reliability and performance.

So we kind of turned into a team that was an advocate for that, and started working with the development teams pretty early in their lifecycle in order to make sure that performance and availability and security and all those concerns were being designed into the system as an overall whole. And as those teams started to move towards using Agile methodologies more and more, there started to become a little bit of a disjoint.

Prior to that when they had been using Waterfall, we aligned with them and we developed what we called the Systems Development Framework, which was kind of systems equivalent of a software development lifecycle to help the developers understand what it is we needed to do along with them. And so we got to a point where it seemed like that was going very well. And then Agile started coming in and it started throwing us off a little more.

And the infrastructure teams – ours and others - more and more started to become the bottleneck to getting systems out the door. Some of that was natural because we have 40 developers and a team of five systems engineers, right? But some of that was because, overall, the infrastructure teams’ cadence was built around very long-term Waterfall.

As technologies like virtualization and Cloud Computing started to make themselves available, we started to see more why that was because once you're able to provision your machines in that sort of way, a huge – a huge long pole that was usually on the critical path of any project started falling out because, best case, if you wanted a server procured and built and given to you, you're talking six weeks lead time.

And so, to an extent there was always – I hate to say an excuse, but a somewhat meaningful reason for not being able to operate on that same sort of quick cycle cadence that Agile Development works along. So once those technologies started coming down and we started to see, "Hey, we actually can start doing that," we started to try it out and saw the benefits of working with the developers, quote, "their way" along the same cadence.

Damon:
When you first heard about these Agile ideas – that developers were moving towards these short sprints and iterative cycles – were you skeptical or concerned that there was going be a mismatch? What were your doubts?

Ernest:
Absolutely, it was very concerning because when folks started uptaking Agile there was less upfront planning, design and architecture being done. So things that they needed out of the systems team that did require a large amount of lead time wouldn't get done appropriately. They often didn't figure out until their final iteration that, "Oh, we need some sort of major systems change.” We would always get into these crunches where they decided, "Oh, we need a Jabber server," and they decided it two weeks before they're supposed to go into test with their final version. It was an unpleasant experience for us because we felt like we had built up this process that had got us up very well aligned from just a relationship point of view between development and operations with the previous model. And this was coming in and "messing that all up."

Initially there were just infrastructure realities that meant you couldn't keep pace with that. Or, well, “couldn't” is a strong word. Historically, automation has always been an afterthought for infrastructure people. You build everything and you get it running. And then, if you can, go back and retrofit automation onto it, in your copious spare time. Unless you're faced with some sort of huge thousand server scale deal because you're working for one of the massive shops.

But everyplace else where you're dealing with 5, 10, 20 servers at a time, it was always seen as a luxury, to an extent because of the historical lack of automation and tools but also to an extent just because you know purchasing and putting in hardware and stuff like that has a long time lag associated with it. We initially weren't able to keep up with the Agile iterations and not only the projects, but the relationships among the teams suffered somewhat.

Even once we started to try to get on the agile path, it was very foreign to the other infrastructure teams; even things like using revision control, creating tests for your own systems, and similar were considered “apps things” and were odd and unfamiliar.

Damon:
So how did you approach and overcome that skepticism and unfamiliarity that the rest of the team had toward becoming more Agile?

Ernest:
There's two ways. One was evangelism. The other way – I mean I [laughter] hesitate to trumpet this is the right way. But mostly it was to spin off a new team dedicated to the concept, and use the technologies like Cloud and virtualization to take other teams out of the loop when we could.

These new products that we're working on right now, we've, essentially, made the strategic decision that since we're using Cloud Computing, that all the "hardware procurement and system administration stuff" is something that we can do in a smaller team on a product-by-product basis without relying on the traditional organization until they're able to move themselves to where they can take advantage of some of these new concepts.

And they're working on that. There's people, internally, eyeballing ITIL and, recently, a bunch of us bought the Visible Ops book and are having a little book club on it to kinda get people moving along that path. But we had to incubate, essentially, a smaller effort that's a schism from the traditional organization in order to really implement it.

Damon:
You’ve mentioned Agile, ITIL, and Visible Ops. How do you see those ideas aligning? Are they compatible?

Ernest:
I know some people decry ITIL and see process as a hindrance to agility instead of an asset. I think it's a problem very similar to that which development teams have faced. We actually just went through one of those Microsoft Application Lifecycle Management Assessments, where they talk with all the developers about the build processes and all of that. And it ends up being the same kind of discussion. So things like using version control and having coordinated builds and all this. They are all a hindrance to a single person's velocity, right? Why do it if I don’t need it?

You know if I'm just so cool that I never need revision control for whatever reason, [laughter] then having it doesn't benefit me specifically. But I think developers have gotten more used to the fact that, "Hey, having these things actually does help us in the long term, from the group level, come out with a better product."

And operations folks for a long time have seen the value of process because they value the stability. They're the ones that get dinged on audits and things like that, so operation folks have seen the benefits of process, in general. So when they talk about ITIL, there's occasional people that grump about, "Well, that will just be more slow process," but realistically, they're all into process. [laughter] It all just depends what sort and how mature it is.

How to bridge ITIL to Agile [pause] -- What the Visible Ops book has tried to do in terms of cutting ITIL down to priorities -- What are the most important things that you need to do? Do those first, and do more of those. And what are the should do's after that? What are the nice to haves after that? A lot of times, in operations, we can map out this huge map of every safeguard that any system we've worked on has ever had. And we count that as being the blueprint for the next successful system. But that's not realistic.

It's similar to when you develop features. You have to be somewhat ruthless about prioritizing what's really important and deprioritizing the things that aren't important so that you can deliver on time. And if you end up having enough time and effort to do all that stuff, that's great.

But if you work under the iterative model and focus first on the important stuff and finish it, and then on the secondary stuff and finish it, and then on the tertiary stuff, then you get the same benefit the developers are getting out of Agile, which is the ability to put a fork in something and call it done, based on time constraints and knowing that you've done the best that you can with that time.

Damon:
You mentioned the idea of bringing testing to operations and how that's a bit of a culture shift away from how operations traditionally worked. How did you overcome that? What was the path you took to improve testing in operations?

Ernest:
The first time where it really made itself clear to me was during a project where we were conducting a core network upgrade on our Austin campus and there were a lot of changes going along with that. I got tapped to project manage the release.

We had this long and complex plan where the network people would bring the network up, and then the storage team would bring the storage up, and the Unix administrators would bring all the core Unix servers up. It became clear to me that nobody was actually planning on doing any verification of whether their things were working right.

We'd done some dry runs and, for example, the Unix admins would boot all their servers and wander off, and half of their NFS mounts would hang. And they would say, "Well, I'm sure, three hours later, once the applications start running, the developers testing those will see problems because of it, and then I'll find out that my NFS mounts are hanging, right?" [laughter].

And always being eager to not disrupt the people up the chain, if at all possible, that answer aggrieved me. I started talking to them about it and that's when it first became clear to me that the same unit test and integration test concerns are equally as applicable to infrastructure folks as application folks. For that release, we quickly implemented a bunch of tests to give us some kind of idea what the state of the systems were – ping sweeps from all boxes to all boxes to determine if systems or subnets can’t see each other, NFS mount check scripts distributed via cfengine to verify that. The resulting release process and tests has been reused for every network release since because of how well and quickly it detects problems.

It's difficult when there's not a – or at least we didn't know of anybody else who really had done that. You know if you go out there and Google, "What's a infrastructure unit test look like," you don’t get a lot of answers. [Laughter]. So we were, to an extent, experimenting trying to figure out, well, what is a meaningful unit test? If I build a Tomcat server, what does it mean to have a unit test that I can execute on that even before there's "business applications" applied to it?

I would say we're still in the process of figuring that out. We're trying to build our architecture so that there are places for those unit tests, ideally, both when we build servers, but we'd like them to be a reasonable part of ongoing monitoring as well.

Damon:
How are you writing those "unit tests for operations"? Are they automated? If so, are you using scripts or some sort of test harness or monitoring tool?

Ernest:
So right now the tests are only scripted. We would be very interested in figuring out if there's a test harness that we could use that would allow us to do that. You can kind of retrofit existing monitoring tools, like a Nagios or whatever to run scripts and kinda call it a test harness. But of course it’s not really purpose-built for that.

Damon:
Any tips for people just starting down the DevOps or Agile Operations path?

Ernest:
Well, I would say the first thing is try to understand what it is the developers do and understand why they do it. Not just because they're your customer, but because a lot of those development best practices are things you need to be doing too. The second thing I would say is try to find a small prototype or skunkworks project where you can implement those things to prove it out.

It's nearly impossible to get an entire IT department to suddenly change over to a new methodology just because they all see a PowerPoint presentation and think it's gonna be better, right? That's just not the way the world works. But you can take a separate initiative and try to run it according to those principles, and let people see the difference that it makes. And then expand it back out from there. I think that's the only way that we're finding that we could be successful at it here.

Damon:
Why is DevOps and Agile Operations becoming a hot topic now?

Ernest:
I would say is that I believe that one of the reasons that this is becoming a much more pervasively understood topic is virtualization and Cloud Computing. Now that provisioning can happen much more quickly on the infrastructure side, it's serving as a wake-up call and people are saying, "Well, why isn't it?"

When we implemented virtualization, we got a big VMware farm put in. And one of the things that I had hoped was what that six-week lead time of me getting a server was gonna go down because of course, "in VMware you just click and you make yourself a new one," right? Well, the reality was it would still put in a request for a new server, and it would still have to go through procurement because, you know, somebody needed to buy the Red Hat license or whatever.

And then you’d file a request, and the VMware team would get around to provisioning the VM, and then you’d file another request and the Unix or the Windows administration team would get around to provisioning an OS on it. And it still took about a month, right, for something that when the sales guys do the VMWare demo, takes 15 minutes. And at that point, because there wasn't the kind of excuse of ”we had to buy hardware” left, it became a lot more clear that no, the problem is our processes. The problem is that we're not valuing Agility over a lot of these other things.

And in general, we infrastructure teams specifically organized ourselves almost to be antithetical to agility. It’s all about reliability and cost efficiency, which are also laudable goals, but you can’t sacrifice agility at their altar (and don’t have to). And I think that's what a lot of people are starting to see. They get this new technology in their hand, and they're like, "Oh, okay, if I'm really gonna dynamically scale servers, I can't do automation later. I can't do things the manual way and then, eventually, get around to doing it the right way. I have to consider doing it the right way out of the gate”.

I say this as somebody who about 15 years ago chose system administration over development. But system administration and system administrators have allowed themselves to lag in maturity behind what the state of the art is. These new technologies are finally causing us to be held to account to modernize the way we do things. And I think that’s a welcome and healthy challenge.

1 Comment | |

Permalink |

Email Article

tagged

Ernest Mueller in

Agile Operations,

DevOps,

Q&A

Monday

Mar222010

Videos: Jesse Robbins, Ezra Zygmuntowicz, Colleen Smith at Cloud Connect 2010

Damon Edwards |

Monday, March 22, 2010 at 11:30PM

Here's another round of "3 Questions" interviews that I shot at Cloud Connect 2010 in San Jose, CA on March 17, 2010.

Jesse Robbins (Opscode / Chef), Ezra Zygmuntowicz (Engine Yard), and Colleen Smith (Symantec) were asked:

1. What brought you to Cloud Connect?

2. What aspect of the Cloud excites you the most these days?

3. Wildcard question!...

Jesse Robbins is the CEO of Opscode and one of the creators of Chef.

Wildcard question: How does "infrastructure as code" unlock the promise of the Cloud?

Ezra Zygmuntowicz is a Senior Fellow and Co-Founder of Engine Yard.
Wildcard question: What tooling changes are needed to make DevOps a reality?

Colleen Smith is an Information Technology Architect at Symantec
Wildcard question: How will Clouds impact the culture of internal enterprise IT?

Thanks to all for playing along!

Q&A: Erik Troan on the role of version control in Operations

Damon Edwards |

Wednesday, February 24, 2010 at 8:11AM

"Just logging into a box and changing the software stack or its configuration is an idea whose time has passed."
-Erik Troan

I recently spoke with Erik Troan about a topic he's passionate about: bringing version control concepts and tools to IT Operations. Below is the lightly edited transcript of the highlights of the conversation.

Erik Troan is the Founder and CTO of rPath. Erik previously served in various roles at Red Hat including Vice President of Product Engineering, Senior Director of Marketing, and chief developer for Red Hat Software. You might also know him as one of the two original engineers who wrote RPM.

Damon:
You were one of the original engineers who wrote RPM. Since your RPM days, how has your thinking about the role of version control in operations evolved?

Erik:
RPM was originally written to let two guys in an apartment build a Linux distribution without losing their minds. We really focused on the build side. Pristine source and patches were very important to us. The idea of sets of packages was not important to us. Probably the biggest shift as I started working with large companies and large deployments -- originally at Red Hat and now with rPath -- package management wasn't about moving around individual packages, it was about installing a consistent set of software on machines and making sure those machines stayed consistent. So that's where this idea of strong system version control came from.

It's interesting to have a version of a package but it's more interesting to be able to have a version of a deployed system. One helps the developers at a Linux distribution shop and the other helps systems administrators out in the wild. When you look at how RPM dependencies work, dependencies are solved against whatever packages happens to be newest (which can change tomorrow) while tools in the version control universe solve dependencies against a versioned repository with richer and finer grain dependency discovery and resolution. When you look at version control from that perspective, you get a sys admin tool rather than a developer tool.

Version control for system definitions is one piece of the scheme -- you still have to have configuration, orchestration, and orchestration processes around that. But version control is a great underpinning to all of that because your orchestration tools and your configuration tools get smarter as you have consistent sets of software on client boxes and a consistent way to maintain those over time. It's the difference between puppet making sure apache is installed and it making sure the right version of apache is installed along with the system components that apache has been validated against. You put puppet policies into a version control system; shouldn't configuration versioning be matched by a version control system for provisioning software?

Damon:
From a modern software development point of view, keeping everything in version control is almost a forgone conclusion. However, in IT Operations the role of version control systems is fairly new and the use of version control systems is an underused strategy. Why do you think that is?

Erik:
I think that version control is a response to complexity. If you go back 30 years who would use version control for software development? The guys who built UNIX used it. People at IBM used it a little bit. But most people had a directory full of source code and they just lived with it. In the 80's, source code became complicated. Projects became bigger. All of a sudden you had a whole bunch of fingers in the pot and things were changing all the time. You needed the ability to understand what was going on. To track what was going on. To do bisection when something broke – discover when it broke? Which patch broke it? All of this really just arose out of the complexity of source code development.

I think you are seeing the same thing happen in the IT arena. Again, if you go back to the 80's, minicomputers had just come out and people only had three computers to maintain. That's just not that hard. Then the 90's come along and people had 25 or 30 Sun machines to maintain. In today's world, people have tens of thousands of machines to maintain. I don't talk to very many companies now who have fewer than 5,000 machines. They have them in cloud environments, datacenters, virtual environments -- just the sheer complexity and scale of that is making version control an important part of the process. It lets you ask questions like "how is this machine configured today?", "how is it going to be tomorrow?", "how is it different than it was yesterday?". It adds that time dimension to systems management so you understand where you are, how you got there, and where you are going next. That's really what version control is all about. And then of course the ability to go backwards -- if something breaks, how can I undo that and get back to something that worked?

Just logging into a box and changing the software stack or its configuration is an idea whose time has passed.

Damon:
What is the value of using version control in an operational environment from an individual's point of view? What's the value from an organizational point of view?

Erik:
One of the things I should emphasize is when we talk about version control we are really talking about a version control system -- the whole system for how you do version control. You can think about it like CVS or Subversion. As an individual systems administrator, I can log into a box and see what version of what software is on the box, which you can do with something like RPM. But I should also be able to see what version of the entire manifest is on there. So instead just versioning individual packages, you've versioned a set of packages. If box #1 and box #2 are running the same manifest version, then I know that the boxes are probably the same. Having the version numbers of 1000 RPMs is information overload; simplifying that to a version number of the complete system manifest simplifies things and makes them comprehensible.

But you also want to do things like tell how a system is different from the version that is supposed to be on there. So you can go to the box and say show me how you are different from the manifest that is on this box? Or, there's a new manifest -- show me how this system differs from that new manifest. The other thing you can do is say that this box is broken -- it was working yesterday but it is not working today -- let me look to my version history and know exactly what has changed. What packages have changed? What configuration files have changed? Who made the changes? This box needs to be online right now so let me undo those changes -- go back in time and put the box back to how it was. So this idea of being able to examine forward, move backwards, and having all of your configurations and everything else under a version control system is really a strong tool in a system admin's quiver.

On the enterprise or departmental level, this really becomes valuable by enabling automation. The Visible Ops handbook certainly talks a lot about the idea of a Definitive Software Library, which is a versioned repository where all of your software comes from. The reason it does that is for all of the software artifacts that are going on machines in your infrastructure you want to know where they came from and how they got there. You want to be able to go back and say "I need to know everywhere this version of SSL is running because there is a security hole in the library". Version control systems let me ask that kind of question. For automation, version control becomes critical. If you are deploying 1,000 machines you're not deploying 1,000 individual machines. You're deploying 1,000 cookie cutter machines. If they are all the same, then that definition ought to be in a version control system so you can view that definition over time and understand how it has changed -- from yesterday to today to tomorrow. Not to mention deploy another 1000 next week that are exact replicas of the last 1000.

Version control systems also bring structure and predictability to the release lifecycle. Systems can be easily and consistently recreated as the system moves through the development, test and production lifecycle and it eliminates the risk of configuration drift as the system progresses through those stages.

Damon:
Like all ideas in technology, I'm sure there are those who will come out on the other side of this issue. What are some of the arguments you hear against the idea of advanced usage of version control in operations or objections as to why it won't work in their specific situation?

Erik:
There's really two arguments. One of which is "that's not the way we do it now... what we do now is OK... it's too hard to change and we just don't want to do it". I get that a lot -- just that inertia -- people don't want to change their systems. The other argument you get is "every machine we have is different so we can't make our machines the same... every artifact is different so why would you version control them?" My answer to that is if you have 1,000 machines in your business and they are all different then you are doing something extremely wrong. Your version control system can help you understand why they are different. It can represent why they are different. But it can also help you eliminate those differences.

And then there is also just the prioritization -- "how much of a priority is this?" and "do we have other things we need to solve today?". It's unusual to talk to anyone who says that using version control to deploy systems is a bad idea. It's just finding the people who feel like it's problem that they have to solve right now because they have a compliance issue, or a repeatability issue, or an automation issue. In IT, it's the problems that get fixed. You don't do a lot of things without a problem to solve.

Damon:
How would you go to a company's executive-level management and explain that they need to be dedicating resources to bringing strong version control practices to their operations?

Erik:
A lot of this comes down to cost control or reduction. Everywhere I look the number of servers under management is growing. Virtualization is putting four managed instances on each physical box. Cloud technologies are provisioning new instances in thirty seconds; each of those needs to be managed. Most shops try to hold a constant ratio of machines per sysadmin. If you let the number of instances grow by five times thanks to virtualization and cloud, are you planning on growing your IT team by five times? If not, you better automate everything you can see. Complete system version control makes system administrators more efficient.

There are two external drivers as well. One is risk; when systems are being hand assembled and configured -- you'll see a Puppet, a Chef, or a cfengine used in a pretty small percentage of enterprises -- how do you know they are going to work tomorrow? if you lose an employee is that kind of knowledge that that person had about how a system is put together -- and why it was put together -- captured anywhere? Or is it just in their head? Version control systems capture that information.

Compliance, such as security compliance standards, is another great motivator. You can have external compliance with a standard like PCI or Sarbanes Oxley. Can you audit your systems to know that they are right? If all of your systems are hand assembled then you don't even have a definition of what correct looks like. So what are you measuring against if people are just allowed to long into a box and change things willy nilly? For example, with PCI compliance -- the standard you have to adhere to if you are going to hold onto credit card numbers for subscription billing -- you are going to have auditors come in and ask "Where does this package come from?", "Who put it there?", "Why is it there?", "How do you get updates on there?", "How do you know which boxes need updates?". Those are the kinds of questions that could be answered by a good version control system. But if you have hand assembled boxes then they are pretty much impossible to answer.

5 Comments | |

Permalink |

Email Article

tagged

Deployment Management,

Erik Troan,

IT Operations,

Infrastructure as Code,

Version Control,

rPath in

Q&A

Friday

Feb052010

Videos: Michael Coté, Travis Campbell, Erica Brescia, Andrew Shafer at OpsCamp Austin 2010

Damon Edwards |

Friday, February 5, 2010 at 4:04PM

Here's the second round of "3 Questions" style interviews I filmed at OpsCamp Austin 2010. For the first round, go here.

Michael Coté (Redmonk), Travis Campbell (University of Texas / LOPSA), Erica Brescia (BitRock / BitNami.org), and Andrew Shafer (Agile Infrastructure developer and speaker) were asked:

1. What brought you to OpsCamp?
2. What excites you in 2010?
3. Wilcard question!...

Michael Coté is noted analyst and blogger/podcaster at RedMonk covering primarily enterprise software.
Wildcard question: How can open source projects better interface with analysts?

Travis Campbell is a Senior Systems Administrator at UT Austin and is on the board of directors of LOPSA.
Wildcard question: What does the HPC community think of "cloud computing"?

Erica Brescia is the CEO of BitRock (sponsors of BitNami.org)
Wildcard question: How does BitRock's unique legacy set it up for success today?

Andrew Shafer is a blogger, speaker, and developer currently focused on Agile Infrastructure.
Wildcard question: What advice would you give people looking into the topic of Agile Operations?

1 Comment | |

Permalink |

Email Article

Q&A,

Video

Wednesday

Feb032010

Videos: Luke Kanies, Bill Karpovich, Ernest Mueller at OpsCamp Austin 2010

Damon Edwards |

Wednesday, February 3, 2010 at 1:00PM

Here's the first round of "3 Questions" style interviews I filmed at OpsCamp Austin 2010.

Luke Kanies (Reductive Labs / Puppet), Bill Karpovich (Zenoss), and Ernest Mueller (National Instruments) were asked:

1. What brought you to OpsCamp?
2. What excites you in 2010?
3. Wildcard question!...

Luke Kanies is the CEO of Reductive Labs and the original author of Puppet.
Wildcard question: How did you enable the Puppet Community's early success?