Rethinking the App Startup Metric

The granddaddy of Android performance metrics is probably application cold start. Setting aside the complications of how to precisely measure it (PY breaks it down really well in this post), lets think about what it represents. Beginning with a user-triggered action, the app process is created by the OS by forking the zygote, leading to the creation of the Application object, after which the Activity lifecycle kicks in (usually), finishing with the UI being rendered, ready to be interacted with.

This is usually the first mobile performance metric app devs look at when they want to understand how well their app is performing in production. Google treats it as one of the golden perf metrics, important enough to be one of the few that are collected automatically and displayed on the Google Play Console. You’ll even find a litany of blog posts and conference talks about how to improve it. I myself have even spoken about this at droidcon SF on how improving it at Twitter helped us grow DAU tremendously.

But do users actually care about this?

Focus On App Launch

I mean of course they care about how long they have to wait for an app to become usable when they tap the app icon. But what they care about is the app launch time as they perceive it, not what is happening under the hood.

The only time users care about cold start duration is when they are in the middle of one, and only in the context of it delaying the app launch they’ve started. And you know what is faster than a cold start? A warm or hot one. Google explains the difference between the types here, but the point is that the fastest cold start is to not have a cold start to begin with. So if you want to improve a user’s experience, don’t only focus on how to make cold starts fast – also think about how you can minimize them.

Ultimately, what users care about is having the shortest app launch time possible. Cold, warm, or hot, they just want whatever’s the fastest. If they had a choice, they’d probably want the device to read their mind, launch the app, and have it be ready when they look down at their phone. If they don’t mind having their minds read, that is. While we can’t quite do that, we can minimize the time it takes for an app to launch – and that goes beyond improving cold start duration.

One Metric or Many?

So you might be asking yourself: Is this guy advocating that we collect all types of app starts under one app launch metric, irrespective of, um, temperature? Oh HELL no. A million times no. Don’t you DARE put that on me!

Munging different operations together into one metric means we can’t understand what actually happens if that number changes. Which workflows got faster or slower? A composite number like that is much less predictive and actionable than its constituent parts, making it more of a vanity metric than if we were to track them separately. So in order for us to measure precise changes to each workflow, we cannot collect one “app launch time” metric that includes all three. These must be different metrics.

That’s not to say we can’t somehow remix the three app startup times into a singular metric somehow. In fact, if you do it right, it can be very powerful. Like the idea of OPS and OPS+ in baseball, there could be further insights gleaned if all the app startups are combined in the right way. But the constituent parts have to be measured separately so that you can determine the underlying changes and take appropriate action if warrented.

So what’s a good way of doing this kind of metrics remixing?

A Dead-End (For Me)

One possible way is to simply smush the datasets for all three types of app startups together, basically the munging I had so objected to earlier. But this time, we will also have each tracked separately. The biggest advantage of this is that it’s easy. However, its usefulness may be limited given the slower startup times dominate the formula, and depending on the distribution of your particular app, the trend for this combined metric may not be far off from just what cold start is.

You can also do a weighted average to dampen the effects of the dominating component – but how will you choose the weighting factors? I suppose you can play around with different ones, and see if you can find correlations with other metrics, then do experiments to validate the relationships. But this process depends heavily on the underlying user base and distributions of data that it produces, so not only will the results not be generalizable, they may change without you even knowing.

Perhaps better folks than me can use this methodology to derive sustainable value from this type of Voltron-ing of the various app startup times, but it may be beyond my capabilities at this point. (If you’re able to do this, please blog about it or just ping me because I’m dying to know!)

A Possible Way Forward?

The one combination of startup metrics that I’ve been noodling on, one that I’ve yet to prove is useful with real data or convinced myself will be too difficult to make work, is creating a ratio of cold starts over total app launches and tracking that as a metric to represent how often a full cold start is required when a user launches an app. We can call it… the cold start rate? The higher this number, the worse it is for the user, so the idea is that it should be kept low (or be reduced if an improvement is sought), and any material increase should be treated as a regression much like how folks treat cold startup time regressions right now.

Will cold start rate be predictive? It’s hard to say without data. Will it be actionable? Maybe – if you can see why the ratio has changed, and have other metrics that would give you clues as to why that maybe the case.

Perhaps you have a memory leak and lmkd is more aggressive with killing your app when it’s in the background because of the process’s higher oom_score_adj value. Seeing a slight increase in your Out-Of-Memory-Exception metric combined with a higher rate of cold starts might lead you to fire up LeakCanary to find the leak to address the problem. In that hypothetical scenario, the increase in cold start rate gave you proof that users are impacted by the leak, and that fixing it also fixes the regression. Simply tracking how long cold starts take will not give you that insight.

Anyway, contrived example aside, the point is not whether this cold start rate metric turns out to be something generally useful. The issue to highlight is that the way many of us monitor app launch times is kind of incomplete, focusing on the absolute value of the slowest kind of app launch, rather than optimizing for the general case. We absolutely can and should do better.

hanson.wtf