Friday, May 22, 2015

Teflon Man

He's everywhere. Everyone who's had a job has probably run into this guy, and maybe been as frustrated as I am with his amazing abilities to deflect all responsibility.

Let me tell you of my latest run in with this Super Non-Hero.

I work in tech. We have a platform on which we process log lines from a web server. They go through a product called syslog, which in turn writes those logs to a central storage unit. All of our environments (Development, Staging, Production) go through this syslog server and writes to our storage. Teflon Man decided that this was affecting our production environment, slowing things down because accessing the storage was getting slow, so he had a plan to change the syslog service to send all environment log lines —except production — to another storage server.

During the work day he changed the configuration file for the syslog server, essentially stopping it from processing anything except the production log lines. He did not however turn off the services writing those log lines from the other environments, so they were still writing log lines to syslog; syslog just wasn't processing them. This meant that the syslog server was creating a queue of those files.

That evening he went in and created new storage areas for the other environments. He then turned syslog back on with a configuration change pointing those log lines to the new storage. Syslog started to process all of the backed up log lines, while also trying to process the current incoming production log lines. We started falling behind in processing because syslog couldn't keep up. The thing is, no one knew that he'd done this evening work. He didn't call anyone, didn't send out an email, nothing.

I was on-call, so I got an alert about the lag. I checked a bunch of system logs to see if I could figure out what the issue was. We were clearly having problems, but it was none of the usual suspects. Eventually I called Teflon Man and he said, "Oh, I did this thing." He explained what he did (see the previous paragraph) and I told him to go back and change it because it was fucking up production. He did so.

Here's where the accountability issue comes in. This morning I spoke with him and identified 3 things that I thought were problems:


  1. That we did not have monitoring that helped us quickly pinpoint the problem. This was a team issue, not his problem, but I thought it was something we should work on because time to resolution was pretty long. This was a no-blame point. All of my team are on notice when it's a monitoring failure. 
  2. That a production change was made at night without a head's up to anyone. 
  3. That he didn't monitor after his change to verify that everything was working smoothly. 
He was able to agree with the first point because it was a team issue. He didn't have to take any personal responsibility for this failure. The other 2 points though? Oh, he had answers for those and at no point accepted that he may have done something wrong. Regarding making a production change at night he said, "Well, it wasn't a production change really. I was only changing Staging and Dev." The change was made on a production box; what part of that means it wasn't production? 

On the third point he had the same argument. He had checked to make sure syslog came back up after he restarted it (which was necessary for the configuration change to take effect) and that was it because, again, he didn't think he was affecting production. Ummm...you had to restart a service that pulls production log lines. How is that not affecting production?! 

Ultimately he blamed it on the technology. Syslog isn't robust; syslog couldn't keep up. It also turned out that after he restarted syslog it was still backed up with the logs from the other environments because the permissions on the new mounts weren't correct, so it couldn't write to the new storage location. Who do you think created the new storage locations? He did. You'd think he'd have checked to make sure they were writing where they were supposed to be, but no. 

He had excuses for everything: excuses for why he hadn't monitored production after making the change to make sure we were still processing properly; excuses for syslog wasn't working; excuses for why he hadn't checked the permissions. At no point did he accept responsibility for his part in the problem. I'm not saying some sort of groveling thing; I mean a simple, "My bad." I make mistakes and I own up to them. This dude? Nope. The email he sent this morning was like this:

It appears that syslog was behind transferring production logs due to issues with processing the dev, qa and staging logs. After disabling the non production environments in syslog the production files started transferring normally and the pipeline is now catching up and lag is now decreasing.

Nice generic email that at no point explains that it was his fault. My emails, when I fuck something up, say things like, "I did x and this happened. I fixed it."

Lest you should think I'm not a team player, let me assure you that that is not the case. When I gave a post-mortem to my boss this morning I didn't say, "Teflon Man did this." I said, "We made changes yesterday" and "last night the config was changed". I kept it vague, I kept it team. No one is trying to throw anyone under the bus. But within the team, when you've caused the on-call person grief, you say, "Whoops, my bad" and own up to your shit.

Not Teflon Man.

This is especially grating this week because I've had a shitty week which started off with a botched firewall deployment which was my fault, and I got grief about it. Direct grief, not vague team grief. I owned up to it with the boss and my team. But this freaking guy right here? Nope. And my boss won't hold him accountable in any way. He found out that Teflon Man made the changes and his response was, "I'm concerned that we didn't troubleshoot it correctly, that it took us longer than I'd have liked to figure out what the problem was." Really? That's what you've got? Someone made a change to the production system sans warning, sans input from others, and you're focusing on how long it took us to figure it out? That's all you've got? Not maybe that if we'd known about the change that would be the first thing we'd have investigated?

Ugh.