• On Grief and Art

    I picked up the 10th anniversary deluxe edition of Sufjan Stevens’ Carrie and Lowell today. It is objectively a terrific album full of beautiful songs about loss and grief, but it is special to me because of a morbid connection: it was the soundtrack to the last few months of my mother’s life. Released a few months before we found out her cancer came back and metastasized, I listened to it non-stop throughout the year as I moved back home to take care of her, all the while watching her fade away. Brain tumours can fuck right off.

    Catharsis or machoism, who even knows. What I DO know is that the record is inextricably connected to that period of my life. Thematically, but also emotionally. Music has a way of cutting through everything and connecting at a level so deep that you might not even realize was there. To me, Carrie and Lowell is a shortcut to a well of not-quite-processed grief about the person I felt unquestionably bonded to, a love that would be gauche to even verbalize (because Chinese).

    Sufjan did not have that kind of relationship with his mom, the titular Carrie of the album whose death resulted in the songs on the album. She left him at a young age, and they did not reconnect until much later in his life. The narrative of the record was based on a fictional past he conjured up as he tried to work out his very-real grief towards a person whom he barely knew.

    Without the backstory, what resulted can be characterized as a manifestation of raw emotion reified as pretty songs. Or something a lot more self-centred, if you see it the way Sufjan sees it in 2025. But if you just listened to the record, it’ll sound like a tribute to a loved one lost, a set of songs that someone in my position, already in huge fan of his, could easily connect with and then some.

    And that I did. No matter how Sufjan himself feels about it himself, it is what it is to me. We speak about separating the art of the artist these days, usually in the context of shitty people making great art and whether you can still enjoy it. But to me, it’s also about separating the intention of the artist, or how they feel about the work, versus how it makes ME feel.

    You can think of it as an extremely self-centred way of going about life, but what is art and art enjoyment if it isn’t about how you as a person relates to it and how it makes you feel? Objectivity in art and art appreciation is always superseded by how it makes people feel.

    Sure, you can break it down mechanically, evaluate using a rubric that is as bias-free as you can make it. But at the end of the day, who gives a shit? It makes you cry or dance and that’s what you want from it, it has done its job.

    Listening to those songs again today brought out those emotions again, but if the music, lyrics, performance, and production weren’t as fantastic as they are, I don’t think it would have. A lot of people have written songs about dead parents, some probably even objectively better, but none of those records are Carrie and Lowell to me.

    While I don’t think any of the songs on it even breaks my top ten of favourite Sufjan tracks, as a record, its importance to me is unparalleled. It has given me a safe space to grieve, to dive back into that pool of feelings through the lens of melody and poetry. And for that, I will forever be grateful to Sufjan, even if he’s embarrassed by what he made.


  • On the Utility of a (User) Session on Mobile

    On mobile devices, the lifetime of a process that backs an instance of an app isn’t necessarily mapped to its usage lifecycle. Even if you ignore implementation details like Android’s process forking and specialization via zygotes, the app process is often already created when a user taps on the app icon to launch the app, and it often stays active in some respect even after the user puts their phone back in their pocket.

    For many developers who collect production telemetry from mobile apps for monitoring and observability, they are only interested in knowing what’s going on when the user is using the app. If theres’s an error and there is no user to see it, does it really matter? Sometimes it does, but more often than not, it doesn’t. Or at least it matters a whole lot less. Mobile telemetry is most useful when it’s user-centric, after all.

    This is why mobile folks have loosely converged on a concept often referred to as a “Session”, which represents a contiguous chunk of time when the app is in use. Data collected on a device during that chunk of time are usually grouped together for the purposes of visualization (e.g. “Session Replay”) and rate-based analysis (e.g. daily sessions per user metric), among others. Such a grouping has some nice properties when you dig into the data. It is often a more interesting unit of measurement for usage compared to counting raw time in app.

    Grouping telemetry together as belonging to a session also allows you to better correlate app and user actions. It tells you that things happening share the same usage context. The sequence of app and user actions flow from one to the other, and they’re not only connected by temporal proximity, but are also connected in a user’s brain. Frustration built up at the beginning of a session will often manifest itself as less patience towards the end. Errors that impede progress early on affect what can be done later on.

    If you were wondering why conversion rate has dropped, it may be useful to look at what happened throughout the session. Did the performance of UI loads drop, leading to greater user abandonment? Were there malfunctions in the credit card adding widget, which prevented users from proceeding to the checkout page? Potential causation like that is difficult to tease out if you don’t directly link telemetry together in such a way that you can partially reconstruct a user’s head space; sessions allow you do that in a crude way. (There are other means, but that’s for a different post.)

    Another way a session can be useful is that it will often give insight into the context of usage that can’t be fully baked into each signal. Device metadata that is too expensive or impossible to acquire or encode when telemetry is being recorded can be applied retroactively. If the OS was throttling the CPUs during the execution of some workflow, unbeknownst to the instrumentation, having an association with a session will allow the trace to be associated indirectly with the reason that could explain its slowness.

    Having a session also allows you to buffer telemetry so that you can send its data only when you know there is no more coming. The atomicity of delivery offers guarantees that could simplify your backend when processing the data, allowing you to skip tedious back-filling that might be needed otherwise.

    But when does a session start and when does a session end? This is where it gets interesting. For me, the answer to that is basically: “when you want it to, so long as you’re consistent”.

    There are a few reasonable ways to define the boundaries of a session. At Embrace, we define it as the time between when an app foregrounds to when it backgrounds. The OTel Android Agent ends a session after some period of inactivity. You can also use a strict time-based start/end scheme if that’s more conducive to how your app is used. The key is predictability, so your analysis can more closely be comparing apples to apples.

    At the end of the day, it’s a matter of taste. Or it is the result of implementation details that may not apply to everyone. There’s no right answer to this. Dealer’s choice. What matters is that there ARE sessions. Even if you define it as the lifetime of a process, you need something to tie together related telemetry from the same device. Why? If it’s not obvious to you, maybe I’ll write about it some more later.


  • On the Real Subject of Observability on Mobile

    Lets get something straight. Observability isn’t the same as monitoring. Yes, both are powered by telemetry collected from production that are crunched down into metrics, shown as some sort of time-series, graph, or table of numbers. But the key difference is that o11y requires flexibility in aggregation, and on mobile in particular, o11y should give you a reasonably accurate estimation of how perf changes directly impact user behaviour and business KPIs.

    The flexibility of aggregation bit is relatively well understood at this point by folks who pay attention to o11y: if you have to specify the dimensions of aggregation ahead of time, either at collection time or when the telemetry is processed, you’ll only be able to cut the data in pre-defined ways. That’s OK if all you want is to be alerted when your SLOs are violated (i.e. monitoring), but if you can’t do ad hoc aggregation of your data to answer questions about WHY your dashboards are all red, that ain’t o11y in the most meaningful definition of the term.

    The point about the ability to estimate user behaviour and business KPI impact is what I’m going to expound on in greater detail here. This is precisely the reason why mobile o11y can be so transformative.

    Since my experience with o11y in the large distributed system context is non-existent limited, I’m not going to claim that it’s a necessary part of a system being observable. After all, the focus for backend o11y is understanding the internals of a complex system, so if the slicing and dicing can tell you why parts of it is breaking down for non-obvious reasons, that’s a job well done and worth of being called o11y.

    Why? Because that understanding will allow you to take action to remedy any problems you can see or anticipate. It does what it says on the box: through it, you can learn about your system and solve problems using the just the data you collect.

    But on mobile, the app you are observing isn’t a big-ass, complex system: it’s a bunch of little instances running on heterogenous devices of unpredictable proportions under vastly different environments. Aggregates of lower-level metrics like heap size on foreground or even p75 of app startup would, at best, only tell what is happening in the app; it doesn’t tell you how the varying levels are impacting your users and their usage of your app. I mean, knowing the percentage of users who experienced a crash in the last 24 hours tells you what, exactly, other that the literal thing it’s tracking?

    No, the target of observation for observable mobile apps isn’t the just the app itself: it’s the users too. Every individual person trying to order food on your app but also the entire population in aggregate. How your users are using your app and how perf issues impact that usage is what you want to understand, so you need to collect both. What’s more, data and metadata that don’t give you an indication of user experience, perf dimensions that affect it, or factors that explain the two, why are you collecting that stuff at all?

    If your telemetry doesn’t tell you if your users are able to load a page or order their Pad Thai, or explain directly why they aren’t getting the value they expect to get from your app, it’s little more than trivia,

    I’m not saying understanding the inner workings of a mobile app through production telemetry isn’t useful. You can find regressions or exemplars of hard to reproduce bugs using aggregate metrics and session replay timelines, respectively. Getting some slice and dicing in there can even reveal or explain hard to isolate cohorts that face unique perf challenges due to unexpected factors. All that is extremely useful and nothing to sneeze at.

    But we can do so much more on mobile because not only do we have the ability to understand how the app is working — we have direct access to individual users and their actions, so we are able to find relationships between the former and the latter. Correlation at the very least, but causation too if you play your cards right with A/B testing.

    We can measure the rate of abandonment for page loads and see the correlation between it and the time it takes for that page to load. Already built into this is user expectation of perf, not in aggregate like some arbitrary line we draw above which perf is unacceptable. No, the individual perf expectations are built in by virtue of whether users stayed long enough to allow the workflow the complete. Success rate is magical like that.

    So that’s why I set a higher standard for mobile observability. Here, I think it’s imperative that we be user-centric in the o11y practice. Lets observe not only the app, but the users as well, so we can explain user behaviour changes through performance. We’d be letting ourselves down if we didn’t do that.


  • A Call to ARMs: Bringing Observability 2.0 to Mobile (Part 1)

    The Observability-Free Zone

    For most companies, “mobile observability” is a misnomer. I say this because the data that most mobile devs have to “observe” their app in production is laughably simplistic. The vast majority make due with basic crash tracking and precomputed metric aggregates as the only lens into how their app is performing in the wild. 

    Those who want more than the bare minimum, like actually-useful data to hunt down ANRs or network request latency from the client’s perspective can pay vendors that provide SDKs and UI that will track user and app events in greater detail. Some of these products are better than others, though very few provide the capabilities that backend SREs are used to when they are troubleshooting production issues.

    For a long time, this kind of basic production monitoring was good enough for most companies who ship mobile apps. Simply knowing whether a new version has caused crashes to spike, or whether the P50 of cold app startup is below some arbitrary value, was seen to be enough. If those numbers looked good – by some definition of good – the app was considered stable.

    But increasingly, people are becoming unconvinced, as users complain about app quality issues that simply don’t show up in those wonderful databoards they have that supposedly tell them how their app is performing in production. Because if all you have are aggregate crash counts and a handful of percentiles, your app isn’t really observable.

    So when I flippantly say that mobile observability is a misnomer, that’s what I really mean: observability is much more than just having graphs of a few key metrics. It is, as Hazel Weakly puts it, about having the ability to ask meaningful questions and get useful answers that you can then act upon. 

    If your tooling can only tell you that some amount of people are experiencing slow app startups but doesn’t give you the ability to figure out who they are or why just them, that’s not observability to me – that’s just monitoring.

    Some folks who are hip to this turn to vendors like my current employer to provide mobile app performance data in production that is actually actionable. Others achieve similar results by building or assembling all the pieces on their own, like my previous employer.

    Still, even as commercial and homegrown mobile observability solutions become more prevalent, most mobile devs continue to be stuck in the dark ages, even as their SRE colleagues rack up eye-watering bills for logging traditional backend observability. 

    A large part of what has caused the arrested development of mobile observability is the lack of demand. Small, already overworked mobile teams aren’t out here demanding more problems to solve, no matter how passionate they are about performance.

    But I think things are about to change: true observability of the 2.0 variety is coming to the mobile world en masse for all who want it. Not only is the demand increasing at a breakneck speed, the ecosystem of tooling is mature enough for mobile solutions to not only work within their own silos, but also enhance existing observability data collected for the backend. 

    To me, mobile is the final Infinity Stone of the Observability Gauntlet.

    So how did we get here? It’s simple, really: good old supply and demand.

    The Demand Problem

    Mobile teams are chronically under-staffed. By that, I don’t mean they are always small – they are just always asked to do more than what their staffing levels can support in actuality.

    To some degree, this predicament is understandable: mobile platforms and ecosystems these days are so sophisticated, an outsider might think that you can ship, maintain, and add features to an app with just a small team of relatively junior devs.

    And they’re not totally wrong. From powerful platforms and tooling, automated functional and performance testing frameworks, robust CI/CD pipelines, and hands-off distribution channels like the Play and App Stores, shipping a v1 of an app has never been easier. However, shipping v1 is just the start.

    The work that consumes most mobile teams after v1 is maintaining a stable user experience as the app and the world changes around it. Adding new features, sure, that takes time, but supporting new devices and OS versions without regressions, all the while features are being added, sometimes haphazardly to meet deadlines, can be deceptively tricky, and not always accounted for when calculating staffing needs.

    This is because the execution environment of mobile apps is so unpredictable, it takes an outsized effort to properly plan, create, and maintain the battery of automated tests that are necessary to ensure that most – not even all – code paths and workflow are properly covered.

    Even when you leave out the different combinations of hardware and software an app has to run on, factors like network connection status, battery, available system resources like CPU, memory, disk, etc. means that unit, integration, and performance tests have a lot of combinations to cover. 

    And that’s before you introduce the chaos that an end user can have on how an app runs. Or the seemingly arbitrary decisions that mobile OSes make that further adds to the entropy. 

    “Oh, you think your background threads are going to finish running when the user takes a phone call? Sorry, there isn’t enough free memory, so I’m just going to kill your app. I sure hope you handled the case when your serialization to disk can be interrupted mid-stream!”

    Simply put, if your mobile test suite does its job well, maintaining it as your code base changes is going to take up a lot of your time. If it doesn’t, production bugs and regressions are going to keep you even busier,

    The hamster wheel can feel draining for mobile teams. Complaints of features taking too long to ship will inevitably lead to a Sophie’s Choice between not writing enough tests or having to fix production bugs later. The last thing teams like this need is tooling that tells them their apps have more issues than just what their Crashlytics dashboard shows.

    Even those who crave more product data to help them debug issues tend to look for tactical solutions for specific problems, to help them fix the bugs that they already know about. For that, the status quo is perfectly fine.

    So what’s changed? Why is there suddenly demand for real mobile observability?

    First of all, the demand has always been there. It’s just been… silo’d. Some folks in the industry have realized that it’s essential for mobile apps to be truly observable – SLOs involving user workflows don’t make sense when you don’t include data from the apps themselves. Latency on the client are measured in seconds – shaving a couple hundred milliseconds in the backend will barely register if the request is running on a 2G network.

    Big Tech, my friends, have understood the importance of client side performance for years – they’ve just built all the tech in-house rather than use vendors. How do I know this? Because that’s what I spent a couple years doing back at Ye Olde Hell Site, before, you know…

    This is where, in a previous draft, I spent a thousand words or so talking about the cross-platform, OpenTelemetry-esque production client tracing framework that I helped conceptualize, build, and rollout, but that’s a tangent I’m skipping for now. Suffice it to say that companies with mobile performance specialists have been all over this. Slack even blogged about it

    And now, what motivated the early adopters will begin to motivate others: because upper management was made to understand its importance.

    You see, busy mobile teams no longer need to be internally motivated to better understand how their apps are performing in production – they’ll be explicitly told to do so by their Directors and VPs, by folks who want to know how mobile apps directly contribute to company KPIs so that they can more optimally allocate their engineering budget.

    All this, because money is no longer cheap, and engineering orgs need to justify their existence and prove their value to the bottom line. The end of ZIRP is the catalyst to the beginning of real mobile observability.

    The End Is the Beginning is the End

    For the uninitiated, ZIRP stands for “zero interest-rate policy”, and during that period (which ended around 2022), the cost to borrow money was very low. The effect it had on tech is that VC investment became abundant as rich people wanted a better return than traditional vehicles. This led to big funds looking to put money into tech startups at a rapidly increasing rate, and buoyed by successes by the parade of unicorns that made so many people wealthy, the money kept on coming.

    In those halcyon days, VC money flowed freely, especially for darling tech companies on the come up. R&D had free reign to spend as long as the company or engineering were perceived to be heading in the right direction. Whether this meant staffing new teams to stand up new products or signing large vendor contracts that provided very specific services, you don’t have to go that high up in the org chart to get approval for projects with significant financial commitments.

    But coming out of COVID, with a macroeconomic climate that featured rising interest rates, the costs for VCs to invest got a lot more expensive, so they’ve become more discerning. With the taps turned off, tech opulence turned into austerity, and that started a domino effect of budget cuts and layoffs. The industry bled, and the new normal is that if your project or team can’t justify their existence or provide a high enough ROI, you may not be around for long.

    While that may seem antithetical to the addition of a new line item in the budget for mobile observability, it is actually quite the opposite. The reason is that mobile performance has always affected app usage, but it’s just not a very exciting thing to back. 

    When competing for attention and money with sexier initiatives like new products and features whose importance and hype are tied directed to the clout of its pushers and their fancy slide decks, it’s really hard for something so relatively uninteresting to be prioritized – if anyone was even pushing for it in the first place. 

    In a vibes-based stack-ranking exercise, something boring like “slow apps make people use them less” don’t tend to end up near the top. But in a world where prioritization is actually data-driven, where you are asked to show your receipts, initiatives that can demonstrably affect the bottom line tend to win out. In that environment, people will go out of their way to find proof of ROI and efficacy for their projects.

    And where would you find better ROI than in a key part of your customer’s journey, a part in which you have limited performance data if any, where a whole class of issues creating friction for your users are invisible to you? Mobile performance is low-hanging fruit galore, and adding observability in your app will bring you truck loads.

    When you ship a regression in the app that slows down certain workflows on a mobile app, it will not be directly reflected in your dashboards if they are based solely on telemetry generated from your servers. Those well-calibrated SLO alerts that your SREs rely on to detect emerging incidents? They won’t fire. 

    If the only telemetry you have for your apps in production are crashes and pre-aggregated metrics, you will have so many blindspots where app performance regressions could be killing you on the margins. Even if you see your KPIs drop, you wouldn’t even know that they were caused by your app being materially slower for some percentage of your users in production because you lack the data to diagnose that.

    So yeah. Every mobile team should want this. Every SRE team should demand that their mobile team use this. The question is: how? If you’re not a big tech company who can throw people and money at the problem, how can you ease yourself into mobile observability without being tied down to a specific vendor’s solution?

    I’ll discuss this further in Part 2. Hint: it starts with O and ends with penTelemetry.


  • On Marvin the Album

    I came here bang out 2500 words on one of the best birthday presents I’ve ever received: the 30th Anniversary release of Frente‘s Marvin the Album on vinyl. I was going to talk about how it was the first album I ever bought myself, on cassette, from the long defunct Music World in Coquitlam Centre.

    But then I realized I would never finish it because I would have so much to say about how important it was to me, how well I know the songs, and how they would be the band I want to see the most, out of every band in my 1200+ CD collection. I would probably bawl if I’m ever in the same room as Angie Hart as she sings the first line of Girl.

    I would also have to talk about how they’ve more or less disappeared from my life, as their second and last album was released in 1996. Even though I was scouring the early internet for singles and EPs long after that for b-sides and covers, I don’t think I’ve actively thought about them much over the last 20 odd years.

    Of course, Angie’s band Splendid appearing on Buffy the Vampire Slayer was a huge deal for me. I also liked that record, plus all of Angie’s solo work that I’ve listened to. But Frente as a thing was in the past for me, and not actively on my mind as a band that I missed. It’s just been so long since they were an active band.

    When I heard about their reunion shows in Australian when Marvin turned 20 (?), it had already happened. At that point, I thought I had missed my chance to see them. I was disappointed, but as I said, having not been actively thinking about them an ongoing concern, it garnered little more than a shrug and “too bad” from me.

    So earlier this year, when I mentioned to Danica about how I would like Marvin on vinyl, being on my nostalgia binge brought on by the 30th anniversary tour that Sarah McLachlan did for Fumbling Towards Ecstasy (which is the other album that hold this level of importance to me), I didn’t even bother looking to see if it existed. 14 year-old me would’ve been disgusted. 34 year-old me would’ve understood.

    I soon found out it existed though: on my birthday this year, when I opened up Danica’s present to me. There it was. And IT WAS SIGNED! Splendid. Newly 44 year-old me was absolutely delighted.

    But even then, I didn’t listen to it right away. I had birthday drinks to go to that night after work, and seeing a small group of my friends in person, something that is fleetingly rare these days, was still how I was going to spend that evening. Between that, the resulting hangover (well-made cocktails still slap), a pub quiz (more friends!), and some leftover work stuff I was itching to get done, I didn’t really have a chance until last night to take a trip down memory lane.

    When I finished up my part of Eliot’s bed time routine, I went into my office and ripped opened the plastic. The cover and linear notes were familiar, but not exactly the same as the what I was used to on the cassette and CD. And I was really familiar with the linear notes on those formats for this record. Really familiar. I didn’t have much recorded music of my own when I was in high school, so the albums I had, I knew every part of them.

    I dropped the needle on the record, switched the input on my amp, and put on the connected headphones. The crackle comes through. Just in time. And then Angie: A girl is the word / that she hasn’t heard. Oh I was so back.

    It’s not that I haven’t listened to that song and album on and off over the last decades — I just haven’t listened to it with such deliberateness in a long time. That’s what this whole vinyl thing is for me, to listen to music intentionally as the only thing I’m doing, to literally put the needle on wax and wait as the right groove is hit and sound is produced. Latency is a gift.

    The first time though was jarring. Girl sounded exactly as I remembered, with a bit more crispness on the vocals due to the listening format/environment. But when I was expecting the end of that to go into Labour of Love, it went to Accidently Kelly Street instead.

    You see, this version was effectively the original Australian release, with the original running order, and the Bizarre Love Triangle cover tagged on the end. The US/Canadian version I was used to had a different running order, as well as a couple of songs that were swapped out. I was familiar with 1-9-0 and Out of Sight because I had heard them from other releases, but they weren’t part of the Marvin experience for me, so they felt out of context when they came on.

    But a different set of songs and running order didn’t materially alter my experience. I was still listening to Marvin the Album with intentionality for the first time in years and years. And everything I loved about it and them came flooding back.

    It was magical. Transcending. The nostalgia sweet spot was worked with precision.

    So of course I googled them afterwards to see if they did a tour when this was release a couple years ago. And of course they did. Last year. Australia. I had missed the chance again. But this time, I was bummed. Really bummed.

    The google tunnel led to their socials, and then a glimmer of hope: they were working on new music! So maybe, in 2025, a reunited Frente (probably just Angie and Simon Austin?) might need to do some promo for a new record, perhaps some small shows in their native land?

    The dream is not dead yet.

    Anyway, where I was? Yeah. I don’t have time to write that post. Instead, I’ll just say thing:

    I love music and Marvin the Album fucking slaps.