A Tale of High-Stakes Debugging: The Upgrade of Hashnode's Domain Management Service

I own the Domain Management service at Hashnode. This is the story behind the crucial upgrade of Hashnode's domain management service which includes a surprise last-minute hiccup and a race against time to debug, patch and upgrade while sticking to the maintenance schedule. Thousands of blogs with custom domains at stake with zero room for risk.

https://twitter.com/Sandeepg33k/status/1606302073464594432?s=20

The journey toward the decision to upgrade

Hashnode uses Vercel for Domain management. One fine day Steven Tey from Vercel notifies us about enhancements on Vercel's side and possible room for upgradation in the setup that wasn't possible earlier due to some limitations. This news triggered a couple of weeks of experimentation to ensure we have everything factored in and working together before we initiate the migration.

This affects thousands of blogs that have a custom domain mapped to Hashnode giving zero room for any unexpected behavior or surprise elements. Steven Tey had been greatly helpful in answering all the queries I had and a few suggestions in designing the upgradation plan. With all the checks ticked, we are ready for the big day.

The Big Day

Its finally here after pushing by a couple of days than what was planned initially (we were okay to delay a bit than rushing and risking users' blogs - ensuring zero downtime for users is our top priority),

✅ The maintenance window is scheduled with a status update 24hrs prior.

✅ Migration scripts are written, reviewed, and tested with multiple trial runs with sample test accounts and are ready to go

✅ Possible error cases are identified and patched.

✅ Prepared to do a trial run on a sample of production data before actual migration.

For the production trial run, I collected the Hashnode team's blogs (~12-13) that have custom domains mapped. The script runs successfully, domains are upgraded to our new setup, and for a moment everything looked perfect. With ~15min or so for the maintenance period to start, we are confident and chill.

Last-minute hiccup 😅

It's too good to be true for large-scale migrations to go super smoothly with no surprising elements right? Yeah, I thought so too. Although I was confident about the efforts that went into this, a part of me is still expecting something to come up unexpectedly - It never hurts to be prepared for unexpected behavior, especially with something at this scale.

And it came up, after sharing a screenshot of the results from the initial trial run with the team, our in-house AWS expert Sandro Volpicella notices that his blog (sandro.volpee.de) missed out on migration even though it's part of the run. Just ~5min for the maintenance window to start and changing it last moment might not make users happy.

One more thing, Sandro also volunteered his AWS Fundamentals Blog to be used for the trial run. Thank you, Sandro, 🙌. If you are looking to learn AWS for the real world, AWS Fundamentals is your final stop. Go grab your copy, psst... if you are a visual learner, they got it covered with Infographics too.

Race begins 🕰️

Thus begins the race to debug why 1 of 13 blogs in the trial run is missed out. Initial suspicion, is it domain TLD - sandro.volpee.de? but that's a valid one and manual migration worked, so that's not it.

The script is doing what it's supposed to do, no surprises there. So, what else can cause this? We can't take the risk of considering this a one-time fluke, 1/13 means it can also be 100/13k or 1k/130k. No way to dismiss this as a temp blimp, I am dead set on figuring it out. Remember that the maintenance window already started 10min back 😅.

Time for a fresh set of eyes, I looped in Rajat Kapoor and we jumped on a call and started thinking of every possible scenario.

Tip for newbies: Never underestimate the power of asking for help and pair-debugging sessions.

Root Cause - The Unexpected Surprise - A Stroke of Luck 🤯

The first step is to get Sandro's blog and verify if the domain is indeed mapped to the correct blog. It is mapped correctly. We were discussing all possible causes and everything checks out and looks to behave as expected. Okay, this is getting crazy now. And, maybe a stroke of luck or my overthinking or whatever it is, it hit me that does the order of properties in the MongoDB $match matter?

Is the following $match query,

{
  $match: {
    domain: {
      certificate: true,
      ready: true
    }
  }
}

Is the same as below - notice the reversed order for the domain object (this was the case for Sandro's blog).

{
  $match: {
    domain: {
      ready: true,
      certificate: true
    }
  }
}

Ideally, the properties order shouldn't matter for an object - after all, MongoDB is a key-value store. We are not sure and the docs have no mention of it that we could find. On any other day, we could have had time to do trial and error to confirm this - we are running against time. We couldn't think of any other possibilities so decided to give this a whirl but fast (this is where pairing sessions come fruitful). Switched the order in $match and BINGO!!!! it returns the blog this time.

To be sure, we verified DB records again and indeed the order is playing the spoilsport. And there is no way of telling how many blogs have the properties in reversed order.

Patching the script - the safe option

It's settled, right? yeah appears so but the query in object format felt still risky as if the originally resulted items might get missed now. We decided to use stringified object notation instead for querying to ensure we query the value and do not care about object structure or its properties in order.

{
  $match: {
    'domain.certificate': true,
    'domain.ready': true,
  }
}

A quick count check on results for all three queries resulted in nearly ~3k additional blogs in the final result for the stringified one.

Have we not figured it out, it would have affected 3k users directly and that's the last thing we want to happen. I admit a bit of luck struck us by the thought about ordering. But we take it.

The important lesson while debugging is even if something feels too obvious, do not dismiss it first-hand, put it into action to confirm your theory. Sometimes, the solution will be staring right at your face.

The story doesn't end here

The migration script is started, but an hour into it my power goes out thus effectively losing live logs I have been monitoring, eventually, the remote instance got shut. And thanks to the amount of status and progress logs we added as suggested by Rajat, we know where to restart. Fortunately, the script and APIs are idempotent so adding the domain again will have a silent error - after all, an error saying "already done" is a blessing too.

I restarted the script and continued to monitor (we were also writing both successful and failed ones to respective CSV files in case of a situation exactly like this. The lesson is, no matter how confident we are, it's always wise to be prepared for the worst - so that you can focus on the solution rather than hitting panic mode.

We did it 🎉 🎊

Finally, the script is done and migration is completed a little beyond the maintenance window but that won't affect any new users so nothing to worry about there.

After a week of closely monitoring the custom domain management service and receiving 0 support tickets - Ayodele Samuel Adebayo and I are finally at peace with manually fixing things.

This migration also meant that we sunsetted 2 of our background services that were responsible for domain management thus bringing us more closely to our serverless architecture. Also, the wait time for the domain to be ready went from minutes/hours to ~10sec if everything is configured correctly on the DNS side.

Huge thanks to Sandeep for trusting me with this crucial piece of our infrastructure 🙌

Key Takeaways

The summary of important lessons that I learned/re-learned from this upgradation,

Never underestimate the power of asking for help and pair-debugging sessions. some times all we need is a fresh set of eyes.
Even if something feels too obvious, do not dismiss it first-hand, put it into action to confirm the obvious theory. Sometimes, the solution will be staring right at your face.
No matter how skillful and confident we are, It's always wise to be prepared for the worst - so that you can focus on the solution rather than hitting panic mode when it matters the most.

Thank you for reading, hope you find the lessons helpful and the story to be fun 😉. Until next time, Ciao 👋

Note: This article is NOT eligible for the #DebuggingFeb Hashnode Writeathon as I am part of the Hashnode team.

Sai Krishna Prasad Kandula's Blog