Explain Like I'm Five: What is IaC?

Tech news
Tech release

More often than not, technical terms sound cryptic and unfriendly. So let’s have a nice #ExplainLikeImFive session with Thinh from our SRE team to disperse the mist surrounding IaC, one of the hottest trends in the field.
 

Hi Thinh! It’s been a while since we last talked to SRE. What have you been up to nowadays?
 

Our transformation to Infrastructure as Code (IaC) has been going on for a while. IaC is one of the hottest trends in our field now, and for good reasons too.
 

Let's back up a bit. Before Infrastructure as Code, tell us first, what is infrastructure?
 

I’m oversimplifying things here, but let’s say that infrastructure consists of the servers and routing networks. Even though it’s not as flashy as other aspects such as marketing, if it’s not working properly, its impact on our products is immediate and significant.

Simply put, the SRE team's job is to calculate, set things up, and make sure that our infrastructure is working normally. 
 

So once they’re all set up, you can just leave everything the way it is?
 

Unfortunately that’s not the case. The SRE team actually monitors our infrastructure almost constantly because changes happen every day. 

  • It can be from internal, such as a change in source code. 
  • It can also be from external, such as fluctuating traffic or a cyberattack. 
  • Moreover, a consumer product like BAEMIN is always running some kind of promotional campaign, and they all have different system requirements. 

Each change affects the system and so we must deal with them properly.

 

How do you configure the infrastructure?
 

For the physical hardware, we either used on-premise servers in Vietnam or a virtual machine, but both needed to be configured manually. No amount of carefulness can fool-proof the infrastructure against human error, so you can imagine the kind of headache we used to have back in the day.

 

With the rise of cloud technology, we have the option of cloud infrastructure that can be charged by computing utility. We don’t need to set up the servers from the ground up like before. We can opt for a pre-set or a template. Not everything is automated though - a bit of manual configuration is still necessary, and the APIs are mostly for individual or team use, but it’s still a big step up from the days of physical hardware.

 

And now, we are in the age of Infrastructure as Code, or IaC.

Now I think we’re ready for the big gun. What is IaC?
 

IaC means our computing system is not constructed manually nor physically, but is described in a machine-readable file. An automation tool then reads that file and builds the system according to our specifications. In our case, this tool is Terraform by HashiCorp.

Simply put, Terraform workflow consist of following steps:

  1. SRE team defines infrastructure in configuration files. The files are written in Hashicorp Configuration Language (HCL), which is the basis of Terraform's configuration language. If learning a new language seems troublesome, then worry not, you can consider using the Cloud Development Kit for Terraform (CDKTF). CDKTF supports popular programming languages such as Typescript, Python, Java, C#, and Go.
  2.  Terraform creates an execution plan (Terraform Plan) describing the infrastructure it will create or update upon existing infrastructure and our configuration.
  3. Upon approval, Terraform provisions our infrastructure following the plan.

 

Can you provide an example of how we apply Terraform? 
 

Sure thing. Here's a chart illustrating the workflow when the SRE team introduces a new module and provisions a certain infrastructure.

 

 

The steps above outline the process of releasing and versioning a Terraform module. When managing infrastructure for various services, some might share similar infrastructure components. These infrastructures are packaged into a module and versioned for efficient tracking and reuse across different services. This process adheres to our Security and SRE team standards for sure.
 

The steps below describe the provisioning process for an infrastructure. When provisioning an infrastructure, the latest version of the Terraform module is used. The code files describing the infrastructure are stored in Git. Upon a Merge event into the Master branch on Git, the CI/CD process is triggered. First, Terraform Plan is executed to perform a dry run, previewing changes or how new components are created. After thorough review and confirmation, we hit that Approve button. Then, Terraform Apply performs the provisioning or modifies current infrastructure components. The final step is Merge  into the Master branch on Git to update the infrastructure code.
 

Oh, and usually, the review process involves many SRE team members. Terraform is quite convenient, aiding the team in staying informed and easily tracking changes, fostering effective teamwork.

 

Beside effective teamwork, how is it better than the traditional method?
 

Compared to the traditional method of manually configuring the physical hardwares, IaC is the superior solution, if I do say so myself. Let’s look at cost, deployment speed, and risk mitigation.

  • Regarding cost, it’s practically free of charge because we’re using the open-source version of Terraform. 
  • Regarding time and effort, before IaC, it took us from 4-8 hours to provision for a service. Now, it takes only half an hour because the code is essentially recyclable. 
  • Deployment is associated with risks, but on this front I’m happy to say that IaC is also advantageous. It is because the traditional manual provisioning is prone to human error. On the other hand, code is testable and versioned. In other words, we can reasonably test it in advance, and trace all changes should it be necessary. Therefore, risks are eliminated, and the module can be standardized by cybersecurity to be reused consistently across our platform.
     

I’d say that IaC is a pretty well-rounded solution, no matter which way you look at it.

 

How many methods are in use?

 

There are 2 ways: Declarative and Imperative. Both terms refer to how we want to reach the desired specifications. 

 

So, what's the difference between these two methods? Let's explore with an example of creating two servers.

  • With the Declarative approach, you simply define the expected outcome without delving into the detailed step-by-step execution. For instance, when creating two servers, you only need to declare "create 2 servers" and you're done.
  • On the other hand, with the Imperative approach, you list out the commands for each specific step. To create two servers, you need to specify individual steps like "create 1 server" -> "add 1 server" -> “confirm execution.”
     

Declarative tools are often more popular because you focus solely on the end result without getting bogged down by extensive operations. This is especially useful when building large-scale infrastructures. At BAEMIN, we’re using Terraform, a declarative tool.

 

Why Terraform?
 

From a technical standpoint, the team considers tools natively integrated with Terraform to align with our current processes. In addition to Git, we are using Terragrunt as a wrapper to streamline source code and CircleCI for our CI/CD needs. Managing the infrastructure state is another advantage of Terraform.
 

Let me explain the concept of state a bit. Imagine you're about to build a house with a blueprint in hand. Similarly, Terraform has a similar "blueprint" for your infrastructure configuration. This "blueprint" is what we call the state. Just like you can't assemble without a blueprint, Terraform can't construct a configuration without a state. Terraform uses state to manage the resources it controls. This state file reflects the desired state of your infrastructure, helping Terraform understand the resources currently deployed. This ensures that Terraform can accurately determine the necessary changes to build the desired infrastructure. For example, if you decide to remove something from the configuration, Terraform still ensures the correct outcome because it tracks the resource dependencies within the state and automatically makes the necessary changes. These state files are stored remotely, in the cloud (referred to as Remote State), for convenient version management, enhanced security, easy sharing among team members, and effective collaboration. Currently, we store these states in an AWS S3 bucket.
 

Beside that, I believe community is also a big factor that teams often consider when evaluating a new tool. Terraform has the advantage of being progressively popular, so there are naturally many free resources we can leverage. Not to mention the big user community that we can rely on in case we bump into any odd issue. 
 

In summary, it’s pretty accessible and customizable, and it has a large ecosystem, which is very convenient.

 

So how have we benefited from IaC?
 

I’ve mentioned some of IaC’s benefits before, but for visualization let’s crunch them into hard and fast numbers. At the cost of $0, we cut the deployment time from 3 days down to 1 day per service.

When you put into consideration the increased go-to-market speed of our products and the risks from human error that IaC helps us to prevent, its benefit is manifold. 
 

That indeed sounds magnificent. Is the project completed?
 

The transition to IaC has started for a while now, but we aren’t there yet. So in the near future, SRE will complete all CI/CD pipelines and IaC for all of our services.

 

So, in case our audience wants to implement IaC, is there anything to be mindful of?
 

A small tip from our team is to modularize infrastructure components whenever possible. Modularization helps to minimize possible conflicts when we make changes or add new components to the infrastructure in the future.
 

It’s always a pleasure talking to SRE. Thanks for the explanations and we’re looking forward to more of your great works!

Share on: