A straightforward dialogue with Tung Phan (Site Reliability Engineering Manager) and Dong Phan (Head of Engineering), where we talk about money and liken system monitoring to stock trading.
Let’s skip the niceties and address the biggest elephant in the room: Money is our major concern.
Dong: That’s true. At the moment, return-on-investment is the key concern for most tech companies globally, and certainly in Vietnam we’re also feeling intense pressure. That’s why cost optimization is our current focus.
Tung: Actually cost optimization isn’t a new directive, at least from Site Reliability Engineering (SRE) point of view. It’s something that we’re constantly mindful about. For example, if there are two systems with similar performance, but one costs twice as much to run, you would say that it’s not efficient.
Imagine how a symptom signifies a sickness, in the same way, cost efficiency alerts us to the health of our system. It’s also a common requirement to build and run a product at the lowest cost possible.
Glad that we see eye to eye right off the bat. So how are we doing on that front?
Dong: We’ve been optimizing costs gradually in phases. After several months of experiments and fine tuning, we’re happy to say that there have been visible improvements in Phase 1, and there will be even more in the future.
Could you elaborate?
Dong: Phase 1 is mostly about computing resources. We use Google Cloud Platform (GCP) for data storage and analysis, Amazon Web Services (AWS) for infrastructure, and some other services. We succeeded in slashing both GCP and AWS costs with 3 key initiatives:
- Automatic scaling of resources to match the incoming traffic;
- Auditing and removing obsolete resources;
- Improving the evaluation and estimation process of capacity planning.
Apart from optimizing cloud computing budgets, there are other ways to bring down our spending that we’re tackling in Phase 2. They are all quite big projects and will yield results later this year.
Alright, so let’s talk first about what we already achieved in Phase 1. How did we cut costs without negatively impacting products and services performance?
Tung: Ah, let me explain it. Cloud computing spending can be classified into fixed and variable costs. Fixed costs, such as databases, cannot be scaled quickly when the traffic fluctuates. Variable costs on the other hand can.
For fixed costs, we review and analyze the traffic trends every 2 months, then make adjustments to the database management accordingly. That’s why you’ve seen that the database cost has steadily declined. We do it every 2 months because it’s long enough for a meaningful analysis. A shorter time period means fewer fluctuations in traffic trends, leading to riskier evaluation.
What about variable costs?
Tung: Variable costs largely depend on the day-to-day traffic, and because the traffic fluctuates by a great margin, SRE team has to monitor the system closely. It involves looking at a lot of charts, anticipating how traffic would shift and by how much, and then adjusting various resources to accommodate.
So to summarize, our goal is to minimize the gap between our expectation and the actual traffic. Funny enough, that is strangely similar to trading on a stock market: you also look at huge charts, predict how a stock would react to a social-economic event, and then buy or sell your assets accordingly.
Oh, I’ve seen some so-called recipes to predict stock market trends circulated on the internet. Do you by any chance follow some best practices like that?
Tung: That’s another funny similarity between system monitoring and stock trading. We have our own “rules of thumb”, but they’re technical of course. These best practices are collectively developed from the way our infrastructure is engineered, combined with the actual experiences of the team in charge of system monitoring.
For example, for a task with boot time of 5 minutes, auto scaling should be around how many percent, health check should take how many seconds, and so on.
Wow, even when equipped with best practices, system monitoring does sound like hard and delicate work!
Tung: The monitoring and tuning of our system is truly an art. Not only does it depend on the team’s hands-on experiences, but it is also strongly impacted by the actual user traffic - something that we can only anticipate to some extent and with a safety margin. Because there are simply too many factors that affect a campaign’s traffic that nobody has full control over.
I’m assuming that each team anticipates a campaign differently.
Tung: That’s correct. For example, the Sales team may focus on the number of orders. However from Tech perspective, traffic volume is our topmost concern because we have to ready the infrastructure to afford that traffic.
There has been a campaign where the traffic increased by 1,800% in one minute. It was a field trip for our team.
Let’s cut to the chase: How much did we save exactly?
Dong: As usual we can’t disclose the exact figures, but I can say that for GCP as of last December, we spent only one third of the cost in May 2022.

Tung: For AWS, we were able to decrease the cost quite significantly down to 70% compared to last June.

That’s amazing. Did the effort to cut costs hurt our capacity for product development?
Dong: As Tung mentioned in the beginning, cost efficiency is actually tied to SRE’s SLAs, therefore the bulk of these tasks are in their scope. So our capacity for product development hasn’t been affected. The Tech Division has kept up the release schedule to meet BAEMIN’s business requirements.
How about the product's performance?
Tung: The performance of our products and services has been stable despite decreasing cost. Performance is tracked by many metrics, so let me single one out for example.
The following chart shows the 90th and 99th percentile, or P90 and P99 respectively, of Elastic Load Balancing’s response time in the last 6 months.

Simply put, P90 and P99 represent the threshold response time that 90% and 99% of users are happy with when using our app. As you can see, both P90 and P99 have stayed pretty much in the same range throughout the last 6 months. So we can safely say that 99% of users don’t experience any change in app performance, in spite of all the cost optimization we’ve pulled off behind the scenes.
Dong: I’d like to chime in that many components collectively contribute to the app performance. There’s our codebase of course, but there are also multiple third-party services and the client device itself to factor in. And with Phase 2 in this year, I believe that our products and services will be able to run more efficiently and at even lower costs.
Can you tease a little about what we’ll do in the next Phase?
Dong: There are actually several initiatives, such as database refinement, third-party service usage, or codebase optimization. Personally, I’m very enthusiastic about transitioning our architecture to microservice design. Although it’s a formidable project, it will let us scale the service in a more flexible and economical manner.
Tung: From the SRE side, recently we’ve seen the new trend of applying AI. Now, we are relying on human’s experiences, but these experiences and historical Marketing data could be collected and learned by AI models. With enough data, the model could detect patterns and suggest the best course of actions to handle a similar situation. This could be a complementary solution to efficient monitoring - but we still have a long way to go yet.
This has been an enlightening talk. Thanks for sharing and looking forward to the next exciting updates from you!