Benjamin's Stream

tbd


GitHub-Style Calendar Heatmaps in Google Sheets

I recently came across Leila Clark's "focusd" page using a GitHub-style heatmap to track deep work time. I really like the approach, because I think habits like "spend more time focused" shouldn't be dichotomized.

I wanted to add this to my current habit-tracking Google Sheet. I made a sheet for the data, and one for the heatmaps. For the latter, I set up the basic grid, added conditional formatting for the heatmap, and a formula that sums up the relevant values from the data sheet.

ChatGPT almost got it right (I only needed to add sheet references and fix one or two wrong refs), here's the full thing: =SUMIFS(TimeLog!$B$2:$B, TimeLog!$C$2:$C, $A$11, ARRAYFORMULA(TEXT(TimeLog!$A$2:$A, "ddd")), $A12, ARRAYFORMULA(WEEKNUM(TimeLog!$A$2:$A, 2)), B$11)

This will break in 2025, but by then I will hopefully either have stopped using it or come up with a fancier solution.

Error Page Redirects Should Not Lose Information

Many web applications will implement some kind of redirect on error responses in order to improve how the error is handled for the user (e.g. by displaying a friendly error page with some further information and means of contacting support) or for the business running the website or both (e.g. redirecting obsolete product links in a shop to some related category page with other products the user might be interested in buying.

This is nicer than just leaving the user stranded and displaying some hard-to-understand error code or similar, but it often has the downside of dropping information. For example, server error pages on Twitter/X will sometimes trap you by redirecting to a static error page path, so hitting refresh never actually does anything other than re-loading the error page you're on, even if the underlying error was temporary. In the case of online shops, redirects are sometimes done without even showing you a message that the product you were trying to get to is no longer available, which is bad communication. Another example I just encountered is while loading an invalid DOI, where the error page drops any mention of the DOI I was trying to load, leaving me with no way of checking whether I made a typo or something else went wrong without "retracing my steps" manually.

If you're implementing friendly errors for a website, you should ensure the user has as good a path forward as possible, but also that no information provided by the user is entirely dropped.

Playing With The Basics

A recurring thing I've seen is that people at the top of their field often continue "playing with the basics" in a surprisingly simple manner. People doing political data science make their own spreadsheets painstakingly collating polling data. People working in macroeconomics have a separate copy of the official statistics on which they run some basic analysis of their own. Finance people pull together quarterly reports of public companies and do some rather straighforward modelling on them.

The common pattern is that it takes some amount of effort, it exposes you to your field (and gets you to interact with the bits and pieces), and it is far simpler, methodologically, than what you'd expect of people at the cutting edge. Here you have a bunch of people coming up with entirely new approaches and models for things, yet this stuff looks like you could start doing it yourself after seeing someone walk through it once or twice. You may not get the same insight from it as an expert when you start out, but the knowledge you need to get started is right there.

It reminds me a bit of how Feynman got over a period of burnout by taking a more playful approach to physics again. Or of the conversations about deliberate practice in knowledge work.

Digital Nicotine Patches

People who try to stop smoking will sometimes use nicotine patches or lozenges to slowly taper the addictive substance while stopping the harmful habit, i.e. smoking cigarettes, instantly. Some try finding another habit like chewing (non-nicotine) gum every time they crave a cigarette.

I'm currently taking a break from Twitter after frittering away an ungodly amount of time on the site recently. My brain is still wired for my Twitter habit. About twenty times a day, I see or read something or get a thought and get the urge to tweet it out.

I decided to try a weird substitution approach: I just started an empty Google Doc. I literally called it "instead of Twitter". Every time I get the urge to Tweet something, I just put a new bullet point in the doc and move on. I still get to act on the urge a little, but I don't get sucked into a feed of interesting content and people I want to talk to each time I do so. And my brain doesn't quite as compulsively search for things that would be fun to put in a google doc all day.

After a few days I'm up to a little over two pages. But it seems to be working as intended.

Algorithmic Nuggets in Content Delivery

I might do a better write-up of this at some point, but for now here's a quick note about a fun paper I came across recently. It's by Bruce Maggs & Ramesh Sitaraman and titled Algorithmic Nuggets in Content Delivery (2016).

In it, the authors look at a range of problems that arise over the lifecycle of a single request to a content delivery network (CDN), specifically Akamai. For each one, they formalize the problem, outline the theory of the algorithm used ot solve it, and point to some practical considerations beyond the theory that arise in the real world.

First up is load balancing between clusters of CDN servers and clusters of end users. Unexpectedly (to me), this is solved using an extension of the Gale-Shapley (deferred acceptance) Algorithm, which comes from economics/market design. Both server- and user clusters have (partial) preferences over which counterparties they'd "like" to serve, so this is a nice fit. Capacity constraints and the partialness of preferences necessitate the extension.

Next is within-cluster load balancing, via allocating served objects to servers. For this, they introduce consistent hashing, which reduces remappings of objects as servers get added to and removed from clusters.

Deciding which objects to cache in the first place (cache filtering) is another interesting problem. It turns out around three quarters of objects are only ever accessed once, so caching them introduces extra disk IO load on servers and takes up storage space while not providing any subsequent benefit. Akamai uses bloom filters to keep track of which objects have previously been requested, and only caches objects on the second request. This about doubles hit rates (because there's more free storage for popular content) and halves disk IO load in practice (which also speeds up reads for previously cached content). Aside: The authors show a cdf of how many objects are requested less than a given number of times, but I wonder what the difference in probability of subsequent requests occurring is between the first and second request.

Overlay routing, that is routing within the CDN between content origins and edge nodes serving clients, is modelled as a multi-commodity flow problem. That is, the content objects are commodities that have to be transported from upstream origins to edge servers. Interestingly, the problem is solved differently for different content types: Dynamic websites are more sensitive to latency, so the algorithms search for shortest paths. Live video, on the other hand, is more sensitive to throughput, so capacity constraints become more important and there are benefits to having multiple routes in parallel to recover lost packets. An interesting detail is that this leads to a mixed integer program, which is hard to solve, but can be "relaxed" to a linear program, which then gets rounded to the nearest integer solution. This may not be optimal, but often gets close at substantially lower computational cost.

The final problem considered is leader election, i.e. how to pick a master node among a cluster from which to replicate state to other nodes. Some constraints are common among use cases, e.g. the assumption that no server acts maliciously (byzantine), and that only servers of good health may be elected as new leaders. Others differ, e.g. in some contexts it is desirable to end up with at least one new leader, in others, it can be at most one. Akamai has its own library of protocols for these cases. For cases where strict consensus is required, such as replicating user session data, established but more complex algorithms such as Paxos are used.

All in all, the paper gives a nice high-level overview of where algorithms theory and practical considerations meet and how a CDN functions. Algorithms and especially their extensions are usually not covered in full, but for most there are references mentioned which provide more details.