• On Vanity Metrics Like Crash-Free Session Rate

    App metrics, especially performance metrics, are only useful if they are predictive or actionable.

    By predictive, I mean you can, with varying degrees of certainty, assume changes to other metrics or outcomes if that metric were to change, for better or for worse. If P95 changes for request latency changes for some end point, does it matter?

    By actionable, I mean that observing a change in that metric means you can take direct counter measures to either remedy the regression, or some how mitigate its impact. If your app gets rated by a set of unknown, unvetted, self-selected individuals who give it a star rating between 1 to 5 (*ahem*), would you know what to do if your rating drops from 3.8 to 3.6 month to month?

    While there can be degrees of predictiveness and actionability, if a metric doesn’t provide at least a modicum of value along at least one of these dimensions, you’re probably better off not using it, lest it lead you down the wrong path. These are what I call vanity metrics – they seem good on the surface, but if you drill in, they provide little actual value other than look good in performance theatre, i.e. the act of working on performance just so you say you work in performance rather than make actual, measurable impact.

    Sometimes, people measure things because they are easy and they *seem* to be useful. In fact, that’s the criteria for many of the statistical categories common in sports. Pitcher Wins. Quarterback Wins. Shots on target. Game winning goals. But a lot of these stats are based as much on the context around a player or team rather than the inherit performance or ability of said player or team, so assuming they are predictive of future results might lead you to bad personnel decisions.

    If all you have are vanity metrics, I can’t blame you from trying to squeeze some value out of them. That shade I threw at Play/App Store ratings earlier – if you’re an indie dev with no other means of getting user feedback, store ratings may be the only indication you have of whether folks are satisfied with your app. As limited they are, beggars can’t be choosers. But often, there are better, more useful alternatives if you only looked deeper and didn’t just go with the status quo.

    In the world of mobile client performance, a popular metric is “Crash-Free Session Rate”. But what does that really tell you, in all but extreme cases? If that rate drops for your app from 99% to 98% month to month, what does it actually mean for your users and how they use your app? You can perhaps use correlational analysis to find out if other metrics changed along with the rate drop, providing some clues as to a possible casual relationship that underpins it and those other metrics. Maybe you can even do an A/B where you induce a change to users in an experiment bucket whereby they crash more often and observe the difference between it and the control bucket in terms of other metrics.

    But real talk: how many people who pay attention to crash-free sessions rates have actually done that level of rigour when trying to assess their impact to users?

    Further, if all you know is the rate, what can you do about it? You need to know the details of the crashes that are causing the rate to go down before you can look to improve it. And if you have the total number of crashes broken down by source, what value would knowing the crash-free session rate buy you?

    Sure, the advantage of normalizing against usage means you can make a more apples-to-apple comparisons between time periods. But is the 99% of last month the same as the 99% of this month given that not all crashes have the same impact to user experience? If your app crashed at startup and prevented a user from even launching the app, that is probably more important than a crash happening while your app has been backgrounded. Again, to know what that 99% means, you’ll need a breakdown of what specific crashes are happening, to figure out the composition of issues that led to 1% of sessions to end with a crash.

    It is this reason why it’s hard to find a direct relationship between crash rate and other metrics – it’s because not all crashes are created equal. Certainly at Twitter, even with the usage that app had, we weren’t able to find even a correlation between crashes and core metrics.

    That’s not to say you shouldn’t work on fixing crashes just because you can’t find a direct impact to other metrics. The inability to find a statistical relationship doesn’t mean they don’t cause real user dissatisfaction. It’s just that the rate at which they happen is often too small, you may not be able to find a statistically significant relationship to other metrics due to the small sample size.

    Instead, when working to reduce crashes, use more actionable metrics – like the rates for specific crashes that point to the cause. Or look for new crashes introduced by a new app version. Neither of these are predictive, probably, but both are actionable, so they provide value.

    Crash-free sessions rate though? For most, that’s a vanity metric, one that is easy to measure but don’t provide much value. Use something better when monitoring crashes. Like all vanity metrics, find better alternatives that are more predictive and/or actionable.

  • Decoding Gino, 1/?

    Gino Pozzo has been in charge of Watford FC for over a decade. Success came relatively early, culminating in the FA Cup final in 2019, but the last few years have been lean to say the least. His methods have always been curious and foreign to the fanbase, but when it worked on the pitch, we mostly didn’t care and would in fact defend them when they were routinely criticized by outsiders, gleefully doing so especially when it’s coming from reactionary British pundits.

    But since the relegation in 2020, we didn’t need Martin Samuel to talk shit about Gino and his ways: we have an increasingly vocal group of fans willing to carry the torch. Much of the criticism is merited, if not leaning a bit too much into recency bias. His seeming stubbornness in sticking to what he knows in a rapidly changing football landscape is perhaps the overarching theme that is hard to fully refute by even the most ardent Pozzo-Ins.

    Having been pretty much silent since he took over, Gino finally spoke to fans directly last month in a fans forum that was much more controversial than it really should have been. Coming out of it, the consensus was that he and Scott Duxbury said nothing of much interest, which didn’t surprise me. In a Q&A format like that where question-askers couldn’t drill-in with follow ups, one that is live-blogged to the rest of the Watford world, how deep could we really get, especially with a man not known to be super open about his philosophy and the reasons behind it?

    That said, when I finally listened to the audio, I was quite surprised at how much I actually did learn. Perhaps not directly from the things that he said, but from the way he said them, the points that he emphasized – and ones he dismissed or glossed over. Basically, it was more informative than I had first thought.

    So what did I learn? I don’t think I can cover it all in one post, but what struck me the most is his conviction, stubbornness if you will, of the process by which he sees his football philosophy getting turned into reality. Let me unpack that a little.

    By process, I mean the methods he puts in place, repeatable elements that he relies on to accomplish short term goals, that when combined, allow him and the club to achieve longer term goals. To him, the rightness of that process supersede whether the results achieved actually meets his and the fans’ expectations. While randomness and swings in luck can alter on-pitch results dramatically one way or another, the methods he installs and their application are a lot less flakey if the right checks and balances are put in place.

    This is why, despite many fans’ insistence that he ought to have learned something from the results of the past two relegations, this was never going to happen. At least that’s the impression I got after listening to him speak. The aforementioned appearance of stubbornness, the lack of contrition, comes from rejecting the notion that on-pitch failures are necessarily caused by the mistakes of his process.

    Some would take this as arrogance, that he thinks his process is perfect and that the two relegations in three years is not on him. I actually don’t think he feels that way. While I don’t think he believes that his process is at fault, I think he thinks that mistakes were made in the application of it.

    Finding the right head coach that aligns with his vision of what a successful style of football management looks like is part of the process. He felt Vladimir Ivic and Rob Edwards can bring that to the table, but for different reasons, neither met his expectations when they came in, so he dispensed with them quickly. I don’t think it’s the results that doomed them, but rather what Gino saw behind the scenes, that how they ran training, managed the individuals, etc. didn’t meet his expectations. And given the results were middling, sacking them fast instead of waiting around for what to him as the inevitable was a no-brainer to him.

    The process was not at fault. But hiring Ivic and Edwards was. At least that’s how I think he feels.

    Now, this may sound like I’m splitting hairs, but in fact I think this is precisely the kind of nuance you have to parse in order to understand why Gino does what he does. Am I reading the tea leaves a bit? Sure, maybe. But he’s never going to tell us directly (a point I’ll elaborate on in a later post), so I’m just going to have to do my best to decode him.

  • Profit vs quality

    Software engineering is all about trade offs. Almost anything within the laws of physics and computers is possible if you pay for it. The currency is not only in dollars or engineer-days, but also in degraded performance in other aspects of the system. Cap theorem, etc. You also pay for it in increased complexity and reduce maintainability to handle rare-but-possible edge cases well.

    But I guess that can all be mostly reduced to dollars and engineers.

    Often, certain levels of performance guarantees are not impossible, just impractical. You want your benchmark requests to return at 50% of the current duration? Sure – but we’ll have to rearchitect everything from scratch and increase the server costs by 4x, as well as double the SRE team to handle all the spinning plates. Give us 2 years and 20 new engineers hired and ready to go in 3 months and you can have it. You want this in 1 year? Uh… 50 engineers, 10x server costs, and triple the SRE headcount?

    It’s not that Twitter needed the ~3000 or whatever engineers and other tech and design folks it had before November of 2022 to maintain the service, but that many were in place to not only keep it running as it is, but to grow it in a way that would please shareholders without blowing things up. Yay capitalism! There are a lot of spinning plates at the Tweet factory and adding more without things falling over is… not easy.

    This is because 98% isn’t good enough. For a service that is as well respected as Twitter (for its backend tech, at least), things have to work virtually flawlessly. “Eventually consistent” timelines with SLAs more than a few seconds can breed conspiracy – why am I not seeing what buddy over here is seeing? See the timestamped screenshot? Shadowbannnnnnnnnn!

    That level of performance at that scale is very expensive. To build, to maintain, to improve. Remember the Fail Whale? The reason we were able to eradicate that is due to many, many, MANY hours of smart people building a system that is fast, resilient, and maintainable. It’s where a lot of the money from poorly targeted ads went to: just making sure everybody can tweet lots and everyone else can see it if they want. This is on top of a flawed but well-intentioned moderation system that folks just shat on without giving a thought about how hard it is to get right. Content moderation at scale is an unsolved problem, but Old Twitter was at least trying to do it – and devoting resources to do so.

    What is unfolding today (July 1, 2023) at New Twitter is the result of callousness and a disrespect for how much work it is to get from 98% to as close as 100% as possible. The cost-benefit may not be there if you want to maximize profits, but that’s what you have to do to make the product as good as possible. Folks at Old Twitter were trying to do that, but it was hard in the context of a shareholder-value maximizing environment of a public company.

    Jack Dorsey is not wrong about Twitter needing to be a private company for it to achieve its self-stated role as the public town square. But that’s not what the current administration is trying to do even though they still use that as a slogan. They want us to pay to see more tweets. It’s a town square, but designed for profit maximization, one that will gladly cut costs to make the product worse if it makes financial sense. But lets call a spade a spade.

    If it’s not obvious to you before that New Twitter isn’t Old Twitter…. well, it’s probably still not obvious since this is just the same shit happening over and over again. But this is just one more brick in the wall in an increasingly fenced-off garden that we the users have helped build that probably cost more money that folks on the outside think it should’ve cost, one that isn’t about maximizing product quality and user safety if it comes at the cost of profits.

    I guess it’s just capitalism at work.