Thursday 28 February 2013

Sometimes it takes a chupito

Sometimes a bug goes to production, and it happened to us last week. When this happens, it use to be a chain of events, of situations, that happens at the same time and allow this particular bug to reach production. As there is always a human interaction, there is also somebody to blame. Here is how we handle blame inside our team.

I was testing a new feature in my testing environment and I noticed that a button of one of the queues was not working at all, I just clicked and nothing happened. No action, no error, no console message... nothing.

My capybara tests where also having trouble with the button, they could not find it and where failing. As this button was not related to the feature I was testing I decided to look at it later.

Once I got some time to look at it, I checked that there were three features merged on this environment, so it should be one of them containing the bug. I did a full deploy of the first story, this is, the master branch and the first feature deployed on the environment, and noticed that the button was not working.

I went to the developers and explained what I had just done. 'No way, this feature and the failing button are not related' he said.

So I deployed the second story and the button was still not working, so the bug was not on any feature, it was on the master branch and we did deploy one day ago to prod.

The second thing I thought was "How did this happen at all? " and just then, a operator came to us to tell that a button was not working on a queue.

So, there was a bug in production, because a javascript code.
And we don't do unit tests for all our javascript code, we should, but we don't.
Because this was a small release, I did not run all my integration tests, I just ran the smoke tests because we wanted to deploy fast and those should be enough. If I should have ran the complete suite I would have caught this bug.

So, as shit happens from time to time, we got a protocol for fixing things.



First, both developer and tester take a chupito, not too strong, 'cos we still need to fix things, but as a shared act of responsibility acceptance, it is our debt to the product and the team, and we pay our debts. (Skol!)
By doing this, we talk with the team about what just happened, how is the fix and how to avoid this happening again.

Then the dev fixed the JS thing and I changed the smoke suite, so a bug like this won't get his way out to prod next time.

For a middle term solution, it is a need we have to create unit tests in JS, and out from integration tests. We also saw this talk from +Amy Phillips and she gave us some light about the way to follow. Not that we already had an idea, but her talk has helped us putting priorities in place. To fasten the deploy process and decouple the testing and the deploying process are going to be our next goals.

Back to the blame, the chupito thing is our way to celebrate our failure, to avoid blame wars and pinpointing anybody because a bug made it to production. We just isolate, celebrate, fix and deploy every bug we find. Some make the way faster, others take some time, but we don't have a separate bug count from our pending features. As +Antony Marcano pointed out, a bug tracking system is nothing but a hidden backlog.



No comments:

Post a Comment

Letting it go.

I dropped my Twitter account. And somehow, it did not make sense to write a tweet about it. Many years ago, when I was riding by bike as cou...