Why Overthinking Is A Requirement
Overthinking is a requirement at global scale. This case study shows how a tiny optimization, left unmeasured, became a worldwide outage, and why the principle stuck.
Why Overthinking Is A Requirement
Introduction: Overthink by Default
“Overthink by default” is not a personality quirk. It is a requirement. It is the discipline of exhaustively exploring corner cases and edge conditions before you make decisions that will be amplified by scale. At that scale, the cost of underthinking is not a small bug. It is an outage with real business impact.
For better or worse, I happen to be the all-time global outage leader at Morgan Stanley. That’s not a brag. It’s more like being the all-time strikeout leader as a hitter. We were building one of the earliest truly global Unix-based distributed systems in the 1990s, moving fast, and pushing into territory no one had mapped. When you go first, you walk out onto thin ice and sometimes you fall through. The point is not that we failed. The point is that we learned the most when we failed, and we got very good at turning failures into durable principles.
This is also why I lean on a biology analogy more than a physics one. Keeping infrastructure alive is less about solving a static math problem and more about managing a living system where every layer evolves at a different rate. Networks get faster, latency patterns shift, operating systems change, and protocol stacks improve unevenly. Nothing stands still, and the system keeps evolving under you.
This article tells the story of a very serious AFS outage that forced a global reboot. That outage and the root cause analysis were the genesis of the principle this article discusses.
The Case Study (late 1990s): AFS at Morgan Stanley
The environment
In the late 1990s, I was responsible for deploying and running one of, if not the, largest AFS installations on the planet.1 AFS was our global WAN-based file system and a key component of the Aurora system.1 We pushed it beyond anything the vendor had seen attempted and were one of the first large-scale commercial enterprises to make it a core dependency.4 5 Our business literally depended on its stability.
The environment was multi-continent and multi-data-center. Bandwidth between sites varied wildly. Some sites were connected by high-bandwidth metro fiber. Some smaller offices were on 56K lines. Earlier, we even had 9600-baud modem links. Those links matter, because in a global system the smallest pipe is often where your corner case lives.
AFS itself is organized around cells. Every client belongs to a home cell. You have global visibility into read-write data across cells and read-only data served from your local cell. We added another conceptual layer by treating multiple data centers in a city as “local” via our tooling, because AFS did not have a native way to group cells into a metropolitan cluster. That was a key architectural decision for this story.
The optimization that looked obvious
An AFS client can reach any of its replica servers over different network paths. Some might be on the same subnet, others a few hops away. We looked at that and said, “This is inefficient. Wouldn’t it be better to always go to the closest server?”
We accepted that at face value. We did not quantify whether it mattered. That was mistake number one.
We used AFS server preferences (serverprefs) to bias clients to the closest server.2 At boot time, our script ran traceroute to each server, sorted by hop count, and set serverprefs so the Cache Manager would prefer the closest servers.
Then we escalated the optimization. We wanted the same behavior not just for the local data center, but for all the data centers in the metropolitan network. That was mistake number two: we never measured whether it was worth doing.
The last mistake was the worst one. We did not have a clean way, at boot time, to identify which servers were “in the MAN” because we did not yet have access to our full configuration metadata. So we made a heuristic call and applied serverprefs to the file servers for all cells globally. That meant every client learned about every server in the environment.
That was the time bomb.
Why it was a time bomb
AFS clients maintain a server table. As soon as a client contacts a server, that server is added to the table. The client also probes servers for liveness.3 The point of this is good: if one replica fails, the client already knows which replicas are up and can fail over quickly.
But the side effect is that the more servers you know about, the more background probe traffic you generate. Before our change, most clients only knew about their local cell and whatever remote servers a user or application had touched. After our change, server tables exploded in size.
We created a global background noise pattern that we did not measure and did not understand.
The outage: the world brings down a small European office
Two or three weeks after the change, the world had rebooted enough times that the new serverprefs behavior was everywhere. We thought everything was fine.
Then a small European office on a 56K link went down. It was a high-profile business site, so we had a lot of attention on it. We initially assumed it was a local network issue. It was not.
Here is the moment from the leadership meeting that still makes me laugh (today – I most definitely was not laughing in the meeting):
Them: Do we know why the site is down?
Us: Yes, we do.
Them: Do we know how to bring it back?
Us: Yes, we do.
Them: OK, how do we do that?
Us: We reboot the entire planet.
Them: (several moments of wide-eyed glaring)
Us: (several moments of sweaty anxiety)
Them: You can’t be serious
Us: Well, actually…..
That was the moment I stopped thinking of overthinking as optional.
Network told us the 56K link was being flooded by AFS traffic. When we asked for the source IPs, the answer was “everywhere.” We did not believe it until we saw the data. It was uniform and global.
We decoded the RX protocol traffic and saw probe RPCs. We called TransArc. They were baffled that every server was talking to every other server. On a few servers we pulled kernel data and saw massive server tables. The AFS engineer asked, “Why does everything talk to everything?” That is when the chills started. “Because we set serverprefs.” That was not a smart thing to do.
At this point, we learned the exact probe behavior. My memory is that the probe frequency changes when a server is marked down, and that the client increases the probe rate to detect recovery. The effect is obvious: the moment the site servers were marked down, probe traffic spiked by an order of magnitude. On a 56K link, that is fatal.
When the site was up, the background probe traffic was already consuming a nontrivial portion of the link (my recollection is on the order of 10-15%). With a downed server, the probe traffic surged, saturated the line, and prevented recovery. The entire site was stuck in a loop: the traffic that should have helped recovery was the thing preventing recovery.
We asked the question that led to the most painful management conversation of my career: how do we clear the server tables?
The answer was brutal. At the time, the only reliable way was to reboot the clients. There was no easy, clean way to purge the server table. That meant the only way to bring the site back was to reboot the world, time zone by time zone, in the middle of the week.
We did it. It was not a popular week. We were not fired because we very quickly demonstrated that we understood the root cause, had a remediation plan, and could prove it would not happen again.
We killed the evil script. We kept serverprefs for local cells only, for a short time. Then we did the analysis we should have done first and realized the improvement was basically noise. We had spent effort optimizing a rounding error and paid for it with a global outage.
Analysis: Where we failed to overthink
This outage has a clean chain of decisions, and each is a failure to overthink by default:
- We accepted “closest server is better” without quantifying the payoff.
- We assumed MAN-level optimization mattered without proof.
- We applied serverprefs globally without understanding the protocol side effects.
- We never measured the before/after background traffic.
- We ignored the combination of extreme scale and low-bandwidth links that creates nonlinear failure modes.
The lesson is not “never optimize” or “never automate.” The lesson is that optimization without measurement is guesswork, and guesswork at global scale is reckless.
How You Know You Have Overthought (enough)
You do not overthink by guessing. You overthink by pushing each axis of scalability and configuration until you can prove where the risk lives.
Here is the operational version of that rule:
- Identify each scalability axis: number of clients, number of servers, bandwidth variance, failure modes, and protocol background traffic.
- Push each axis to the extreme in analysis, even if it feels absurd.
- Quantify the higher-order terms until they are clearly noise or clearly dominant.
- Stop only when you can prove the marginal effect is negligible.
Here are a few analogies for higher-order terms: effects that are negligible at one scale and dominant at another.
Air resistance is a rounding error at 25 mph and the whole bill at 75 mph. At low speed, you can ignore drag; at highway speed, drag is what you are paying for.
Checkout lines behave the same way. At 50% utilization, wait times feel fine. At 95%, the wait time explodes even if the line only gets a little busier.
Network transfers flip depending on size. For big files, bandwidth dominates. For tiny RPCs, latency dominates. The part you ignored is now the only thing that matters.
When we fail to overthink, we underthink. And underthinking is how you end up rebooting the planet.
References
Note: The AFS operational details cited here use OpenAFS documentation as a proxy for the older TransArc AFS 3.x behavior. The TransArc 3.3/3.4 manuals are not readily available, and for this historical piece the OpenAFS references are sufficient to explain the concepts.
- Xev Gittler; W. Phillip Moore; J. Rambhaskar. “Morgan Stanley’s Aurora System: Designing a Next Generation Global Production Unix Environment.” USENIX LISA ’95. 1995-09. Public copy: https://www.usenix.org/legacy/publications/library/proceedings/lisa95/full_papers/gittler.pdf Ghost attachment: https://the-infrastructure-mindset.ghost.io/content/files/2026/02/lisa-1995-aurora-system.pdf
- OpenAFS Project. “fs_setserverprefs(1) - Sets the preference ranks for file servers or VL servers.” OpenAFS Reference Manual. IBM Corporation (copyright 2000). https://docs.openafs.org/Reference/1/fs_setserverprefs.html
- OpenAFS Project. “fs_checkservers(1) - Checks which file servers are up and down.” OpenAFS Reference Manual. IBM Corporation (copyright 2000). https://docs.openafs.org/Reference/1/fs_checkservers.html
- W. Phillip Moore. “When Your Business Depends On It: The Evolution of a Global File System for a Global Enterprise.” AFS Best Practices Workshop. 2005-03-25. Stanford, CA. Ghost attachment: https://the-infrastructure-mindset.ghost.io/content/files/2026/02/afs-best-practices-2004-when-your-business-depends-on-it.pdf
- W. Phillip Moore. “When Your Business Depends On It: Development and Deployment of a Global File System for a Global Enterprise.” Decorum ’97. 1997-03-06. Long Beach, CA. Ghost attachment: https://the-infrastructure-mindset.ghost.io/content/files/2026/02/decorum-1997-when-your-business-depends-on-it.pdf