Thursday 28 February 2013

Sometimes it takes a chupito

Sometimes a bug goes to production, and it happened to us last week. When this happens, it use to be a chain of events, of situations, that happens at the same time and allow this particular bug to reach production. As there is always a human interaction, there is also somebody to blame. Here is how we handle blame inside our team.

I was testing a new feature in my testing environment and I noticed that a button of one of the queues was not working at all, I just clicked and nothing happened. No action, no error, no console message... nothing.

My capybara tests where also having trouble with the button, they could not find it and where failing. As this button was not related to the feature I was testing I decided to look at it later.

Once I got some time to look at it, I checked that there were three features merged on this environment, so it should be one of them containing the bug. I did a full deploy of the first story, this is, the master branch and the first feature deployed on the environment, and noticed that the button was not working.

I went to the developers and explained what I had just done. 'No way, this feature and the failing button are not related' he said.

So I deployed the second story and the button was still not working, so the bug was not on any feature, it was on the master branch and we did deploy one day ago to prod.

The second thing I thought was "How did this happen at all? " and just then, a operator came to us to tell that a button was not working on a queue.

So, there was a bug in production, because a javascript code.
And we don't do unit tests for all our javascript code, we should, but we don't.
Because this was a small release, I did not run all my integration tests, I just ran the smoke tests because we wanted to deploy fast and those should be enough. If I should have ran the complete suite I would have caught this bug.

So, as shit happens from time to time, we got a protocol for fixing things.



First, both developer and tester take a chupito, not too strong, 'cos we still need to fix things, but as a shared act of responsibility acceptance, it is our debt to the product and the team, and we pay our debts. (Skol!)
By doing this, we talk with the team about what just happened, how is the fix and how to avoid this happening again.

Then the dev fixed the JS thing and I changed the smoke suite, so a bug like this won't get his way out to prod next time.

For a middle term solution, it is a need we have to create unit tests in JS, and out from integration tests. We also saw this talk from +Amy Phillips and she gave us some light about the way to follow. Not that we already had an idea, but her talk has helped us putting priorities in place. To fasten the deploy process and decouple the testing and the deploying process are going to be our next goals.

Back to the blame, the chupito thing is our way to celebrate our failure, to avoid blame wars and pinpointing anybody because a bug made it to production. We just isolate, celebrate, fix and deploy every bug we find. Some make the way faster, others take some time, but we don't have a separate bug count from our pending features. As +Antony Marcano pointed out, a bug tracking system is nothing but a hidden backlog.



Thursday 7 February 2013

The Deploy

As you might know, I am the tester guy at peerTransfer. I find myself emdebbed into the developers team. This time I want to tell a story about a deploy, a nice one, a big one.

We use to create short stories about what needs to be done, the usual as a Biker riding my bike, I want my bike to go faster so I can reach my destination in less time.



We create a github issue with such a story, and to give visibility to all the company we use a Kanban board to write this stories down. Then we start a conversation about details as do we mean faster on straight line or around corners, are we still going to stop on the traffic lights, drive by night when there is less traffic... details that help us having context about what is the problem and how we might solve it.

This time, we needed a deep refactor of the operations queues. Our company backend has basically three steps, we collect money from our users, we move it from account to account and we pay to our schools. This is our business, this is what we do.

The refactor was about taking out logic out from the daily operations and create a more complex setup, so to automatize the daily operations, looking for decrease the effort we do when performing such operations. As a biker, I want this bike to go faster.

It soon came out that this story could not be split on several releases, as we needed to do core changes on the site, whenever we would deploy we needed to do it all at once, or at least a big deal at the first time. As a biker, I need to use my bike on my daily commute.

So we created a attack team, this is, a team with people from developer, operations and product teams, this commando hold the needed meetings to define what exactly we were about to deploy, and then, the three devs defined a list of tasks that needed to be done and started coding.



Whenever each one of the developers found some trouble, they paired with another to solve the issue and if any doubt came out, a chat room with the rest of the attack team helped to clarify how things where supposed to work. We built a new engine in the garage, without pulling nothing out from the bike.

At some point, the development branch was ready to be deployed so we reserved a testing environment to use it for testing. We deployed the issue and we checked that the happy path was working as expected. Using a beer can as fuel tank, we started the engine to check how well it was doing.

Time to test. We decided to split effort, so while I was updating the automated test suite with the new features, the ops team member was testing that it was working as expected, for doing this, he created a set of tests with examples of usage and checked that all the results where the expected ones.

We found some bugs that where solved and deployed in no time, and we also found some improvements that would be nice to have on next iterations. After all, this is the first one and for a limited amount of time it is going to be okay to have some rudimentary controls... as long as we build them later. Somehow new stories are quite a valid result of a testing session.

Then we met again. I explained what I did automate, he explained what he had tested, what was working. We came out with new questions, we found that another test case would be nice to automate, and then we had a conversation about how the feature could break. we designed new tests to learn about what would happen is things where badly configured or what could go wrong. We found out that the pass to production would be a tricky question. We needed to prepare the bike before we changed engine. We decided the steps to take.

Our new tests found new bugs, so while we automated the last test, the bugs where solved, deployed and tested.

At that point the build was green and we had the definitive +1 from the operations team.



It was time to deploy. We pushed the button and waited the time to run the scripts, we deployed to staging environment... And the deploy failed.

Well, the deploy went well, but we needed to run a rake task to set things up and this was failing due to some conflicts on the last merge.

The unit tests where all green but the integration tests failed because of the failure on the deploy. Then we looked for the cause, fixed it and deployed again to staging.

This time we had success on the deploy. But the time frame to deploy to production was over, and as next day was Friday, and we don't deploy on Friday.

Uh, well... at least we have a rule that says that we don't deploy on Fridays.

This rule, is a agreement the team made at some point in the past. They all agreed that it was better not to deploy on Fridays, to avoid trouble during the weekend. But time has passed since, and now we have a more automatized deploy process, with better tests and better monitoring. So now we are more confident about the deploying process. More confident about jumping our own rules.

So we deployed on Friday morning, what the hell, this is why we test, we check and we monitor!

And the deploy went fine! Operators started using this feature and now they need less time to perform the same actiona, we found a couple of minor bugs once in production, and none critical enough to justify the time we should have needed to catch them before release.

We also sent an email to all the company explaining the new feature, because we like to communicate when we manage to deploy a big story like this one. We like to tell that our bike runs smooth and faster now!

As a tester, I took a look to the requirements, I automated some smoke tests, I helped designing and performing tests, I looked for the deploy process asking when and how we should deploy without causing damage. All that is testing, all that has a result in the quality of the work that we deliver.

We did a nice work, we did deliver a nice feature called #482, it's time to celebrate!

Credits: 
I found the pic of the Vespa here
The idea of the bike analogy is from this book
The other pics are from our peertransfer office.
The team I work with is a great team!

Letting it go.

I dropped my Twitter account. And somehow, it did not make sense to write a tweet about it. Many years ago, when I was riding by bike as cou...