Blog - dev2ops - Solving Large Scale Web Operations and DevOps Problems

Friday

Aug312007

Dev sees the world one way while Ops sees it a different way

Damon Edwards |

Friday, August 31, 2007 at 1:39AM

There is an old saying that if all you have is a hammer, everything begins to look like a nail.

Much of the dev2ops problem comes from Dev seeing the world one way while Ops sees it a different way. When it comes to developing, deploying, and supporting what the business owners would call an application, these two world views often spectacularly collide.

To the dev folks, their view of the world is all about the build (and its related dependencies) along with data/content requirements (schema, standing data, catalog/user data, web assets, etc...). They see the application from the inside looking out, fully aware of its internal components and inner workings. The lifecycle they care about begins with business requirements, continues through a set of related builds, and then on to an integrated running service. Their goal is to then promote this service from one dev and test environment to another until it becomes someone else's problem (that's usually the ill-fated handoff to ops). Supporting, and in many ways enforcing, this point of view are the day-to-day Dev tools: software version control systems, build tools, IDEs, requirements management tools, etc.

To the ops folks, their view of the world is all about the physical architecture, the network, and the box. Applications? Ops doesn't want to know about the the inner workings of the applications. They see applications from the outside looking in, as just another set of files that are part of the puzzle they manage. Environments? Sets of distinct boxes wherever possible (sometimes divided by datacenter, VLAN, racks, etc...). The lifecycle Ops cares about takes a much different form than the one that Dev cares about. Quality of service, capacity, and availability are the driving factors for Ops. Ops establishes the needed hardware and software platform profiles in a lab setting and then uses those profiles to build and refresh the different environments they control (generally staging, production, and disaster recovery). The preferred Ops toolsets, usually EMS systems (Tivoli, Opsware, OpenView, OpenNMS, BladeLogic, etc..), support and enforce this point of view.

Unfortunately, aligning these two world views is just as difficult as aligning their respective toolsets.

It is all too common that users (and by extension, vendors) from both groups want to force their tools and point of view on the other group. But no matter which direction the management mandate ultimately comes from, the truth lies just beneath the surface. Dev wants nothing to do with the Ops tools and Ops wants nothing to do with the Dev tools. They see little value in the other group's tooling of choice because they think it will add additional complexity to their lives and won't help them to complete their core jobs. Because of these dramatically differing points of view, the conversations in the trenches will often revolve around how it's "the the other group's fault" and why "they just don't get it".

Fundamentally, the Dev and Ops groups are at odds because there is a difference of primary concerns. Dev doesn't want to know about networks and boxes. In most situations Dev really don't even really care where something is deployed until they get a late night phone call from Ops to debug a mysterious problem. Ops, on the other hand, just wants "operations ready" application releases and are frustrated that they never seem to get that.

Sadly, too much corporate blood has been shed over a collision of differing points of view that are both equally valid.

2 Comments | |

Friday

Aug242007

dev2ops goes wrong: familiar story in the trenches

Alex Honor |

Friday, August 24, 2007 at 2:29PM

Being a consultant I get a wide exposure to real life dev2ops experiences at both large and small companies and usually get called in the aftermath of a bad deployment. Businesses of course accomplish dev2ops in their own ways but run into trouble as their applications and operations become more complicated. I always like to start by listening to the key dev2ops individuals to get the story from the person on the ground. These key dev2ops guys recount something like this:

The big application upgrade was scheduled for 8pm but ended up being at 11, due to some last minute bugs found in QA. After the fixes were committed to the version control system, new software builds were created. These builds were a bit different from what had been discussed in the pre-deployment planning meetings. They also included some new configuration rules and relied on some manual tweaks outside of the normal procedure.

Last minute changes were understandable, since the development team was rushing to meet a project deadline and had been working within an aggressive and compressed time frame. Unit development had been moving along steadily but the last stage integrations were problematic and relied on senior developers to work out kinks in order to get the necessary disparate builds working in QA.

By 12am, the production updates were well underway. There are a lot of machines and files to distribute to them, along with various system commands that need to be run. The normal update process is semi-automated but still very intensive. One must be very careful when performing these updates -- one misstep can slow down the whole upgrade process. After all the binary packages were installed, and SQL scripts executed, manual tweaks performed, all parts of the application were restarted to bring the whole site to use the new code and configuration. Subsequent testing of the application uncovered that some new features did not work. The cause was not apparent and suggested faulty application code. A key developer diagnosed the issue, finding the hastily written manual procedures missed a few configuration file changes. The missed edits were applied, the application was restarted again and this time testing confirmed all new features were working. By 2am, all user load was directed to the general production environment, and everybody involved with the upgrade went home. The night shift in operations would monitor the application and let everyone know if any issues arose.

At 7am, users started logging into the application and began conducting transactions. As more logged in, random errors began to occur. Users begun reporting error messages in their web browsers. Operations noticed load spiking on the production machines. The application's quality of service continued to deteriorate to the point where the problem was escalated to the key individuals involved during the night deployment.

By 7:30am, operations was troubleshooting the problem within the system and network layers while development was logged into several production machines looking at application log files. At 8am, the whole application appeared to degrade into a giant, unresponsive CPU and network consuming monster. System administrators observed full process tables. Network admins discovered unresponsive ports. Developers read off application stack traces. All able hands on deck were requested to join a bridge conference call and visibility of the problem had made its way to the "C" level management ranks. Everyone began asking each other what might be the root cause. Was it a change in the new software or some other possibly unrelated change?

The development group was not the only busy team during that past few weeks. The operations team was making various updates to the infrastructure, some in preparation for the big application update, others for good proactive maintenance. These changes -- new firewall rules, operating system updates, security patches, and web server configurations -- were scheduled and implemented at different times and none had a negative impact. But, each one would now be suspect.

At 9am, system administrators and developers were logged into production machines, furiously hacking configuration files, undoing bits and pieces of the new application, and restarting the application's server processes. Caution was thrown to the wind, as those attempting to resolve the problem made ever more daring steps to restore service. Application response sporadically improved but the whole system did not become stable.

Finally, at 9:30 management makes a dreaded yet inevitable call: back out the new application code and bring the site back to how it worked the day before. Naturally, everyone was reluctant to attempt reverting back to the old versions because it's a tricky and intensive procedure under normal circumstances. On the other hand, application quality of service was so bad, what would be the real user impact?

Around noon, after two and a half hours of mammoth effort, the whole production environment was running on the old versions and the site was operating normally. 5 hours of business service had been interrupted. Postmortem meetings held by management revealed that the planned upgrade was affected by an incompatibility between a newly patched system library applied to production machines that differed from what is used in development and QA machines. Further, during the troubleshooting, someone discovered that several machines appeared to have incorrect configuration files. Finally, the analysis highlighted that backing out an application change was extremely difficult, error prone and took much too long.

I have heard this kind of story at big and small companies by people that subscribe to formal methodologies and from those that eschew them. Sometimes the severity and scope of the problems vary, some cases there is little to no user impact while others it's a total site outage.

Anyone that has experienced a nightmare like this knows it is quite hair raising, a good cause of burnout and certainly tests the morale and team play in any IT organization. Reflecting on the chain of events, one realizes that the problem does not just boil down to a lack of project management, administrative policy, nor technology. But, there is an obvious lack of coordination, coordination in the most general sense. How an IT organization aligns itself to directly support the dev2ops process certainly is a factor in avoiding deployment nightmares.

1 Comment | |

Thursday

Aug232007

What makes dev2ops so hard anyway?

Alex Honor |

Thursday, August 23, 2007 at 3:38PM

Migrating application changes from development to production was simpler when systems were centralized, target environments homogeneous, and the whole process needed little or no automation. But today's business applications have evolved to run 24/7, use distributed architectures, are hosted in multiple environments and the process to change them has become increasingly more complicated and very hard to automate.

I have noticed several factors that add up to making dev2ops technically very challenging in both large and small organizations:

Rate of change: Many businesses depend on driving change from development, through test environments and on to production operation, on a daily basis (sometimes even more!). Worse yet, some things need to change faster than others. For example, content and data need to change constantly while core technologies like platforms less frequently. No matter what, the business desires the ability to update any and all parts of the application when needed. Can this be done predictably and reliably?
Nature of change: What comprises application change varies, too: data, code, configuration, content, platforms. Each kind of change, has a different impact on the overall stability of the application. Does the organization understand change is not created equally?
Complexity: Applications supported by distributed architectures and multiple hosting environments do not present a simple interface to widespread nor pinpoint modification. Understanding how to make changes at each point and then figuring out a sane way of coordinating change across the board is tricky. Who knows what connects to what and the best way to change the application over time?
Organizational boundaries: Who performs dev2ops? Developers, operations, both or other? The application development is carried out by developers so are they best suited to install it and set it up in a production environment? Operations understands how to maintain and scale infrastructure so should they take apart the application and deploy it the way they best see fit?

Reflecting on these challenges organizations face, it's amazing that the dev2ops process works at all! In the end, its accomplishment usually falls on a few dedicated, knowledgable individuals, working long, intense hours at the worst times (ie, late nights and weekends).

Does dev2ops have to be an "all hands on deck" scenario or can best practice and better tools bring efficiencies and reliability?

1 Comment | |