Engineering

From Java Microservices to Lambda functions – a journey

You may be one of many organisations (or an engineer in one) that operates Java microservices in the cloud with a desire to move towards a serverless architecture, but are unable to justify the steep migration path (e.g. decomposing your services into functions, rewriting in a more suitable language etc.) from those microservices to the likes of AWS Lambda.

But fear not! Because with the help of spring-cloud-function you can repurpose your existing microservices into serverless functions in a gradual and controlled manner, with minimal effort or interruption of service.

In this article I’ll explain how you can achieve this utilising the Strangler Fig Pattern to quickly prove out the approach and see if it fits your needs. I’ve used AWS CDK, ECS, ALB and Lambda to demonstrate how you can move traffic from a Java microservice to multiple Lambda functions.

I’ve built a sample codebase and accompanying CDK code. I’ve used git branches to show how you go about transitioning over to Lamdba, which I’ll be talking about through this post:

https://github.com/foyst/spring-petclinic-rest-serverless

(The above repo is freely available for people to base prototypes on, to quickly experiment with this approach in their organisations)

It’s based upon the Spring Petclinic REST Application example. I wanted to use something that was well known and understood, and representative of a real application scenario that demonstrates the potential of spring-cloud-function.

Note that the API and documentation for spring-cloud-function has changed over time, and I personally found it difficult to understand what the recommended approach is. So I hope this article also captures some of those pain points and provides others with a guide on how to implement it.

Along the journey I battled against the above and other gotchas, that can distract and take considerable time to overcome. To not distract from the main narrative I’ve moved these to the back of this post, but will list them here for completeness, and signpost to them at the times they cropped up in my explorations.

Setting up AWS CDK

If you’re following along with the GitHub repo above, make sure you’re on the master branch for the first part of this blog.

I’ll not digress into how to set up CDK as there are plenty of resources out there. These are the ones which I found particularly useful:

https://cdkworkshop.com/15-prerequisites/500-toolkit.html

https://cdkworkshop.com/20-typescript/20-create-project/100-cdk-init.html

https://docs.aws.amazon.com/cdk/api/latest/docs/aws-construct-library.html

https://gitter.im/awslabs/aws-cdk

Using the above references, I created my CDK solution with cdk init sample-app --language typescript as a starting point.

I’ve put together a simple AWS architecture using CDK to demonstrate how to do this. I’ve kept to the sensible defaults CDK prescribes, its default configuration creates a VPC with 2 public and 2 private subnets. This diagram shows what’s deployed by my GitHub repo:

I have an RDS MySQL instance, which is used by both the ECS and Lambda applications. ECS is running the Petclinic Java microservice and Lambda is running its serverless counterparts. An Application Load Balancer is used to balance requests between the Fargate container and Lambda functions.

I used Secrets Manager to handle the generation of the RDS password, as this also allows you to pass the secret securely through to the ECS Container. For info on how to set up Secrets Manager secrets with RDS in CDK so you can do credentials rotation, I used https://dev.to/michaelfecher/i-tell-you-a-secret-provide-database-credentials-to-an-ecs-fargate-task-in-aws-cdk-5f4.

Initially I tried to deploy a VPC without NAT gateways, to keep resources isolated and unnecessary costs down. But this is where I encountered my first Gotcha, due to changes in the way Fargate networking works as of version 1.4.0.

Stage One – All requests to Java Container

So in the first instance, all of the requests are routed by default to the Petclinic Fargate Service:

Later on, I’ll use weighted target groups and path-based routing to demonstrate how you can use the Stangler Pattern to gradually migrate from microservices to Lambda functions in a controlled fashion.

To deploy the initial infrastructure with the RDS and ECS Services running, cd into the cdk folder and run the following command:

cdk deploy --require-approval=never --all

This will take some time (~30 mins), mainly due to the RDS Instance spinning up. Go put the kettle on…

Transforming a Spring Boot Microservice to Serverless

Now for the meaty part. In my GitHub repo you can switch over to the 1-spring-cloud-function branch which includes additional CDK and Java config for writing Spring Cloud Functions.

There’s a number of good articles out there that demonstrate how to create serverless functions using Java and Spring, such as Baeldung with https://www.baeldung.com/spring-cloud-function. Where this blog hopefully differs is by showing you a worked example on how to decompose an existing Java microservice written in Spring Boot into Lambda functions.

Importantly, make sure you use the latest versions and documentation – a lot has changed, and there are so many search results pointing to outdated articles and docs that it can make it confusing to understand which are the latest. The latest version of spring-cloud-function at the time of writing this is 3.2.0-M1, and the documentation for this can be found here:

https://docs.spring.io/spring-cloud-function/docs/3.2.0-M1/reference/html/spring-cloud-function.html#_introduction

This too: https://cloud.spring.io/spring-cloud-function/reference/html/

And example functions can be found in https://github.com/spring-cloud/spring-cloud-function/tree/main/spring-cloud-function-samples

So, can you do dual-track development of the existing application alongside splitting out into Lambda functions? Yes, by pushing all application logic out of the REST and Lambda classes (delegating to a Service or similar if you don’t already have one) and having a separate Maven profile for lambda development. By comparing the master and 1-spring-cloud-function branches you can see the additional changes made to the pom.xml, which include this new “lambda” profile:

The Maven lambda profile is aimed at developing the lambdas. It has the spring-cloud-function specific dependencies and plugins connected to it, which ensures none of those bleed into the existing core Java application. When you do development and want to build the original microservice jar, you can use existing Maven build commands as before. Whenever you want to build the Lambda jar just add the lambda profile, e.g. ./mvnw package -P lambda

For this example I’ve created a couple of functions, to demonstrate both how to define multiple functions within one jar, and how to isolate them in separate AWS Lambda functions. I’ve called them getAllOwners and getOwnerById which can be found in src/main/java/org/springframework/samples/petclinic/lambda/LambdaConfig.java:

    @Bean
    public Supplier<Collection<Owner>> getAllOwners() {
        return () -> {
            LOG.info("Lambda Request for all Owners");
            return this.clinicService.findAllOwners();
		};
    }

    @Bean
    public Function<Integer, Owner> getOwnerById() {
        return (ownerId) -> {
            LOG.info("Lambda Request for Owner with id: " + ownerId);
            final Owner owner = this.clinicService.findOwnerById(ownerId);
            return owner;
        };
    }

This is where I experienced my second gotcha! Spring Cloud Functions aspires to provide a cloud-agnostic interface which gives you all the necessary flexibility you may need, but sometimes you want control of the platform internals. In the above for example, you can’t handle returning a 404 when a resource is not found because you don’t have access to the payload that’s returned to ALB/API Gateway.

Thankfully, after posting a Stack Overflow question and a GitHub issue, promptly followed by a swift solution and release (many thanks to Oleg Zhurakousky for the speedy turnaround!) you can now bypass the cloud-agnostic abstractions by returning an APIGatewayProxyResponseEvent, which gets returned to the ALB/API GW unmodified:

    @Bean
    public Function<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> getOwnerById() {
        return (requestEvent) -> {
            LOG.info("Lambda Request for Owner");
            final Matcher matcher = ownerByIdPattern.matcher(requestEvent.getPath());
            if (matcher.matches()) {
                final Integer ownerId = Integer.valueOf(matcher.group(1));
                final Owner owner = this.clinicService.findOwnerById(ownerId);
                if (owner != null) {
                    return buildOwnerMessage(owner);
                } else return ownerNotFound();
            }
            else return ownerNotFound();
        };
    }

    private APIGatewayProxyResponseEvent buildOwnerMessage(Owner owner) {
        final Map<String, String> headers = buildDefaultHeaders();
        try {
            APIGatewayProxyResponseEvent responseEvent = new APIGatewayProxyResponseEvent()
                .withIsBase64Encoded(false)
                .withBody(objectMapper.writeValueAsString(owner))
                .withHeaders(headers)
                .withStatusCode(200);
            return responseEvent;
        } catch (JsonProcessingException e) {
            throw new RuntimeException(e);
        }
    }

    private Map<String, String> buildDefaultHeaders() {
        final Map<String, String> headers = new HashMap<>();
        headers.put("Content-Type", "application/json");
        return headers;
    }

    private APIGatewayProxyResponseEvent ownerNotFound() {
        final Map<String, String> headers = buildDefaultHeaders();
        APIGatewayProxyResponseEvent responseEvent = new APIGatewayProxyResponseEvent()
            .withIsBase64Encoded(false)
            .withHeaders(headers)
            .withBody("")
            .withStatusCode(404);
        return responseEvent;
    }

Using the APIGatewayProxyRequestEvent as a function input type does give you access to the full request, which you’ll probably need when extracting resource paths from a request, accessing specific HTTP headers, or needing fine-grained control on handling request payloads.

For local testing of the functions, there’s a few ways you can do it. Firstly you can use the spring-cloud-starter-function-web dependency, that allows you to test the lambda interfaces by calling a http endpoint with the same name as the lambda @Bean method name. For example you can curl localhost:8080/getOwnerById/3 to invoke the getOwnerById function.

Secondly, if you want to debug the full integration path that AWS Lambda hooks into, you can invoke it the same way Lambda does by creating a new instance of the FunctionInvoker class, and passing it the args you’d call it with in AWS. I’ve left an example of how to do this in src/test/java/org/springframework/samples/petclinic/lambda/LambdaConfigTests.java, which is how the function variants are tested within the spring-cloud-function library itself.

When you’ve developed your Lambda functions, tested them, and are ready to build a jar to serve in AWS Lambda, you can run the following command:

./mvnw clean package -P lambda

You’ll see in the next stage that this is run as part of the CDK deployment.

In the target folder, you’ll see alongside the existing jar that’s used in our Docker microservice container that there’s a new jar with an -aws suffix. What’s the difference between this jar and the original jar? Why do I need a separate variant? Because AWS lambda doesn’t support uber-jars, where jars are nested inside each other. To work in Lambda you have to generate a “shaded” jar, where all of the dependency classes are flattened into a single level within the jar archive. Additionally, by using the spring-boot-thin-layout plugin, you can reduce the size of the jar by removing unnecessary dependencies not required in Lambda, which can bring a small cold start performance improvement (the bigger the jar the longer it takes to load into the Lambda runtime):

<plugin>
                        <groupId>org.springframework.boot</groupId>
                        <artifactId>spring-boot-maven-plugin</artifactId>
                        <dependencies>
                            <dependency>
                                <groupId>org.springframework.boot.experimental</groupId>
                                <artifactId>spring-boot-thin-layout</artifactId>
                                <version>1.0.10.RELEASE</version>
                            </dependency>
                        </dependencies>
                    </plugin>
                    <plugin>
                        <groupId>org.apache.maven.plugins</groupId>
                        <artifactId>maven-shade-plugin</artifactId>
                        <configuration>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <shadedArtifactAttached>true</shadedArtifactAttached>
                            <shadedClassifierName>aws</shadedClassifierName>
                        </configuration>
                    </plugin>

Stage Two – Balancing requests between Lambda and ECS

Once you’re at the point where you have Lambda functions in your Java app ready to handle requests, we can configure the ALB to route requests between two target groups – one targeted at the incumbent Petclinic container, and the other at our new functions.

At this point, switch over to the 1-spring-cloud-function branch.

In here you’ll see an additional lambda-stack.ts that contains the AWS Lambda configuration, and additional changes in lb-assoc-stack.ts, which creates a target group per Lambda function and uses weighted rules to balance traffic between Lambda and ECS:

In this scenario I’m using the ALB integration with Lambda, to demonstrate that the approach is compatible with both ALB and API GW, and that both methods use the same approach from a code implementation perspective.

In lambda-stack.ts, everything is included to build the Lambda functions. In CDK you can incorporate the building of your code into the deployment of your resources, so you can ensure you’re working with the latest version of all your functions and it can all be managed within the CDK ecosystem.

I followed the these two articles to set up a Java Maven application build which delegates the building of the jar files to a Docker container.

https://aws.amazon.com/blogs/devops/building-apps-with-aws-cdk/

https://aws.amazon.com/blogs/opensource/packaging-and-deploying-aws-lambda-functions-written-in-java-with-aws-cloud-development-kit/

The Docker image you use and the commands you run to build your app are configurable so it’s very flexible. AWS provide some Docker images, which ensures that artefacts that are built are compatible with the AWS Lambda runtimes provided.

From the screenshot above you can see that the folder specified by the Code.fromAsset command is mounted at /asset-input, and AWS expects to extract a single archived file from the /asset-output/ folder. What you do in between is up to you. In the code above I trigger a Maven package, using my lambda profile I declared earlier (skipTests on your project at your own risk, it’s purely for demonstration purposes here!).

This is where I encountered the third gotcha… when you see CDK TypeScript compilation issues, double-check your CDK versions are aligned between CDK modules.

Now on the 1-spring-cloud-function branch, rerun the CDK deploy command:

cdk deploy --require-approval=never --all

Rerunning this command will deploy the new lambda functions, and you should see the ALB listener rules change from this:

To this:

Another gotcha to be aware of with Lambda – at the time of writing it doesn’t natively integrate with Secrets Manager, which means your secrets are set statically as environment variables, and are visible through the Lambda console. Ouch.

So at this point, we have an ALB configured to balance requests to 2 owners endpoints between the Petclinic container and new Lambda functions. Let’s test this with a GET request for Owner information:

In doing this we’re presented with a 502 Service Unavailable error. Not ideal but digging into the Lambda CloudWatch logs we can see the first challenge of deploying this lambda:

Further up the call stack we see this issue is affecting the creation of a rootRestController, which is a REST Controller within the petclinic application

The cause behind this with the Petclinic app is that there’s a RootRestController bean that configures the REST capabilities in Spring, which requires Servlet-related beans that aren’t available when you startup using the FunctionInvoker entrypoint.

To avoid this issue, we can conditionally omit classes relating to the REST endpoints from being packaged in the jar within the lambda Maven profile:

                    <plugin>
                        <groupId>org.apache.maven.plugins</groupId>
                        <artifactId>maven-compiler-plugin</artifactId>
                        <configuration>
                            <excludes>
                                <exclude>org/springframework/samples/petclinic/rest/**</exclude>
                                <exclude>org/springframework/samples/petclinic/security/**</exclude>
                            </excludes>
                            <testExcludes>
                                <exclude>org/springframework/samples/petclinic/rest/**</exclude>
                                <exclude>org/springframework/samples/petclinic/security/**</exclude>
                            </testExcludes>
                        </configuration>
                    </plugin>

I also needed to exclude classes from both the rest and security packages, as the security packages tried configuring REST security which relied on components no longer being initialised by the RootRestController behind the scenes.

This brings me on to my fifth gotcha – the Petclinic app uses custom serialisers that weren’t being picked up. I scratched my head with this one as I wasn’t able to override the ObjectMapper that was being autoconfigured by spring-cloud-function. However, the following changes fixed the automatic resolution of these serialisers:

  • Upgrade spring-cloud-function-dependencies to 3.2.0-M1 (Bugs in previous versions prevented this from working correctly)
  • Remove spring-cloud-function-compiler dependency (no longer a thing)
  • No longer need to explicitly create a handler class for lambda (i.e. class that extends SpringBootRequestHandler), just use the org.springframework.cloud.function.adapter.aws.FunctionInvoker as the handler
  • If you have more than one function (which you will do as you begin to migrate more actions/endpoints to serverless), use the SPRING_CLOUD_FUNCTION_DEFINITION environment variable to specify the bean that contains the specific function you want to invoke.

Redeploying with all of the above changes resulted in working Lambda functions. At this point we’re able to send requests and they’re picked up by either ECS or Lambda with the same result. Nice!

Stage Three – Strangulating requests from ECS

At this point, you can start to gradually phase out the use of your long-running Java microservices by porting functionality across to Lambdas. This approach allows you to build confidence in the Lambda operations by gradually weighting endpoints in favour of your Lambda function target groups.

You’ll have probably noticed by this point that Lambda variants of Spring Boot are very slow to start up, which brings me to my sixth gotcha. This may be a dealbreaker for some situations but I’d encourage you to explore the points in my conclusion below before deciding on whether to adopt (or partially adopt)

When porting subsequent features or endpoints to Lambda functions, I’d suggest routing a small percentage of your traffic to the Lambda function as a canary test. So long as the error rate is within known tolerances you can gradually route more traffic to the function, until the Lambda is serving 100% of requests. At this point you can deregister the Fargate target group from that particular ALB path condition and repeat the process for other endpoints you want to migrate.

Conclusion and next steps

This blog article aims to give you a guided walkthrough of taking an existing Spring Boot Java microservice, and by applying the Strangler Fig Pattern transition your workloads into Lambda functions in a gradual and controlled fashion. I hope you find this useful, and all feedback is greatly welcomed!

There’s further considerations required to make this production-ready, such as performance tuning the Java Lambdas to reduce the cold start time, and removing DB connection pools that are superfluous in Lambdas and will cause additional load in your database(s). Here’s some suggestions for these in the meantime, but I’m aiming to follow this up with an in-depth post to cover these in more detail:

  • Analysing the impact of cold-starts on user experience
    • When your Java Lambda initially spins up, it takes a similar amount of time to start up as its microservice counterpart. Given the lower memory and cpu typically allocated to Lambdas, this can result in a cold boot taking anything up to 60 seconds, depending on how bulky your application is.
    • However, considering your particular load profiles, so long as there’s a regular stream of requests that keep your Lambdas warm, it may be that cold-starts are rarely experienced and may be within acceptable tolerances
  • Tweaking the resource allocations of your Lambdas
    • Allocating more memory (and cpu) to your functions can significantly improve the cold start up time of Spring (see the graph below), but with a cost tradeoff. Each subsequent request becomes more expensive to serve but no faster, just to compensate for a slow cold start up. If you have a function that’s infrequently used (e.g. an internal bulk admin operation) this may be fine – for a function that’s used repeatedly throughout the day the cost can quickly be prohibitive.
  • AWS Lambda Provisioned Concurrency
    • The pricing of this can be prohibitive, but depending on the amount of concurrency required and when it’s required (e.g. 2 concurrent lambdas in business hours for a back-office API) this may be suitable
      • Continue to assess this compared with just running ECS containers instead, and weigh it up against the other benefits of Lambda (e.g. less infrastructure to manage & secure) to ensure it’s good value
  • Using Function declarations instead of @Bean definitions (which can help to improve cold start time)
  • Replacing DB Connection pooling (i.e. Hikari, Tomcat etc.) with a simple connection (that closes with the function invocation)
    • And combining this with AWS RDS Proxy to manage connection pooling at an infrastructure level
  • Disabling Hibernate schema validation (move schema divergence checking out of your lambdas)
  • Experimenting with Lambda resource allocation, to find the right balance between cold start time and cost
    • See AWS Lambda Power Tuning – an open source library which provides an automated way to profile and analyse various resource configurations of your lambda

Gotchas!

First Gotcha – Fargate 1.4.0 network changes

As of Fargate 1.4.0, Fargate communications to other AWS Services such as ECR, S3 and Secrets Manager use the same network interface that hooks into your VPC. Because of this you have to ensure Fargate has routes available to AWS Services, otherwise you’ll find that ECS is unable to pull container images or acquire secrets. You can either give the containers a public IP, route traffic to a NAT running in a public subnet, or create private VPC endpoints to each of the AWS services you require: https://stackoverflow.com/questions/61265108/aws-ecs-fargate-resourceinitializationerror-unable-to-pull-secrets-or-registry

Second Gotcha – Spring Cloud Function signature challenges

Update: following Oleg’s quick turnaround and release of spring-cloud-function 3.2.0-M1 a lot of what’s below is moot, but I’ve kept it here for reference.

There have been developments to abstract the AWS specifics away from function definitions, and have the translation happen behind the scenes by the function adapters (AWS, GCP, Azure, etc.) into a more generic Map payload, rather than to specific AWS types (APIGatewayProxyResponseEvent, for example).

I was confused reading the Spring Cloud Function documentation and didn’t find it clear how to make the transition from specific handlers to a generic one (FunctionInvoker). In fact, if you try to follow one of the many guides online (including the latest Spring docs) and use Function<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> the adapter ends up wrapping it in another APIGatewayProxyResponseEvent which is malformed. AWSLambdaUtils tries to be helpful here but with the confusing documentation and behaviour, it just got in my way.

Feels like the approach is to abstract away from cloud implementation details in the spring-cloud-function core, and push all of that into the adapters. Problem with that is the adapters then have to map from the generic Message interface into an AWS response (APIGatewayProxyResponseEvent), which is great as it abstracts the cloud-platform implementation detail from you, but if you want that level of control there’s no way to override this

The way I got clarity on the recommended approach was to ignore all the docs and examples, and go with the unit tests in the spring-cloud-function-adapter-aws repo. These demo the latest compatible ways of declaring functions.

I experimented with a few styles of function signatures to see which works…

Doesn’t work:

@Bean
public Supplier<Message<APIGatewayProxyResponseEvent>> getAllOwners() {

Doesn’t work (didn’t correctly set the Content-Type header in the HTTP response)

@Bean
public Function<APIGatewayProxyRequestEvent, Collection<Owner>> getAllOwners() 

You have to use the Message construct if you want control over the Content-Type header, otherwise the Message construct provides a contentType header which is incorrect. This then doesn’t get picked up by the AWS adapter (resulting in a malformed response with a Content-Type of “application/octet-stream” and another “contentType” header of “application/json”, which doesn’t get picked up).

Works:

@Bean
public Supplier<Message<Collection<Owner>>> getAllOwners() {
final Map<String, Object> headers = new HashMap<>();
headers.put("Content-Type", "application/json");
...
return new GenericMessage<>(allOwners, headers);

Also works:

@Bean
public Function<APIGatewayProxyRequestEvent, Message<Collection<Owner>>> getAllOwners() {
final Map<String, Object> headers = new HashMap<>();
headers.put("Content-Type", "application/json");
...
return new GenericMessage<>(allOwners, headers);

There are some limitations with using the Message construct – you’re not allowed to use null payloads when using GenericMessage, which makes it difficult to handle 404 situations.

Closest I could get was returning a Message<String> response, serialising the Owner object myself into a string using the ObjectMapper before returning it wrapped in a Message. That way when I needed to handle a 404 I just returned an empty string. Not pretty (or consistent with the ECS service) though:

All of the above, however, is no longer an issue – as explained in the main article thread, you can now return an APIGatewayProxyResponseEvent and have full control over the payload that’s returned to API GW/ALB.

Third Gotcha – CDK library module version mismatches

NPM dependencies are fun… between working on the core infrastructure code and adding the Lambda functions, CDK had jumped from 1.109.0 to 1.114.0. When I introduced a new component of CDK, it installed the latest version, which then gave me confusing type incompatibility errors between LambdaTarget and IApplicationLoadBalancerTarget. Aligning all my CDK dependencies in package.json (i.e. "^1.109.0") and running an npm update brought everything back in sync.

Fourth Gotcha – AWS Lambda lacking integration with Secrets Manager

There’s a gotcha with Lambda and secrets currently – it can’t be done natively. There is currently no way to natively pass secrets into a lambda like you can with ECS (see this CDK issue). You have to use dynamic references in CloudFormation, that are injected into your lambda configuration at deploy time. This, however, is far from ideal, as it exposes the secret values within the ECS console:

One way you can mitigate this is to make the lambda responsible for obtaining its own secrets, using the IAM role to restrict what secrets it’s allowed to acquire. You can do this by hooking into Spring’s startup lifecycle, by adding an event handler that can acquire the secrets, and set them in the application environment context before the database initialisation occurs.

Fifth Gotcha – Custom Serialisers not loaded in function

Those curious people will spot that if you explore the shaded lambda jar that’s built, although the *Controller classes have been omitted the (De)serialisers are still there. That’s because the compiler plugin has detected dependencies of them in other classes (as opposed to the Controllers which are runtime dependencies and so wouldn’t get picked up). This is fine…

Next challenge… the REST endpoints are using these custom (De)serialisers to massage the payloads and avoid a recursion loop between the entities in the object graph:

At this point I realised maybe Baeldung’s post is a little out of date (even though I’ve bumped up to the latest versions of spring-cloud-function dependencies). So let’s see if I can find a more recent post and port my code changes to that.

But what all of this lacks is an example of how to decompose an existing microservice into one or more serverless functions.

By following the Spring documentation and upgrading to the latest version I was able to resolve the issue with the custom Serialisers not being utilised OOTB.

Sixth Gotcha – Cold start ups are sloooooooooo…w

Spring Boot can be notoriously slow at starting up, depending on the size of your applications and the amount of autoconfiguring it has to do. When running in AWS Lambda it still has to go through the same bootstrapping process, but given the typically lower resources allocated to functions (MBs as opposed to GBs) this can draw out the startup of a lambda to upwards of a minute.

This doesn’t affect every request – once a Lambda has warmed up its instance becomes cached and stays in a warm (but paused state) until the next request comes in, or AWS Lambda cleans up infrequently used functions. This means subsequent function requests typically respond as quickly as their microservice counterparts. But however any requests that don’t hit a warm function, or any concurrent requests that trigger a new Lambda function will incur this cold start penalty.

This may be acceptable for some workloads, for example requests that are triggered in a fire and forget fashion, infrequent (but time consuming) operations, or endpoints that have a steady stream of requests with little or no concurrent requests. I would advise you as part of prototyping this approach to look at the recommendations in the Conclusion section, and using the Fig Strangler Pattern approach documented above trial it out on a subset of requests to understand how it’d affect your workloads.

My expedition from AWS to GCP (with Terraform)

Follow along with my GitHub repo for this blog: https://github.com/foyst/gcp-terraform-quickstart/

TL;DR – Here’s the headline differences that might be useful for those adopting GCP from an AWS background:

  • Difference #1 In AWS, projects or systems are separated using AWS Accounts, where as GCP has the built-in concept of “Projects”.
  • Difference #2 – The GCP Console is always at global level – no need to switch between regions.
  • Difference #3auto_create_subnetworks = false, otherwise you have a subnet created for every availability zone by default.
  • Difference #4 – You have to enable services before you can use them.
  • Difference #5 – Firewall rules apply globally to the VPC, not specific resources.
  • Difference #6 – You don’t define a subnet as public.
  • Difference #7 – An Internet Gateway exists by default, but you have to explicitly create a NAT gateway and router.
  • Difference #8 – To define a Load Balancer in GCP there’s a number of concepts that fit together: Front-end config, Backend Services, Instance Groups, Health Checks and Firewall config.
  • Difference #9 – Load Balancer resources can be regional or global interchangeably.

As someone who has spent a few years working with AWS I’ve decided to dip my toe into GCP, and I thought it’d be useful to chronicle my experiences getting started with it so others may benefit.

I prefer to learn by doing, so I set myself the task of creating a simple reference architecture for load balancing web applications. The diagram below outlines this architecture:

High-level diagram of demo architecture

The goal is to deploy a load balancer than can route traffic to a number of instances running an nginx server. The nginx server itself is the nginxdemos/hello Docker image, which displays a simple webpage with a few details about the instance:

Overview page presented by the Nginx hello Docker image

We’ll use the status page above to demonstrate load balancing, as the server address and name change when the load balancer round-robins our requests.

Creating a Google Cloud Account

Creating an account is relatively straight forward (if you already have a Google Account – if not you’ll need to set one up). Head over to https://cloud.google.com/ and click on the button in the top right of the page.

Once you’ve got setup with a Google Cloud Account, you’ll be presented with a screen similar to this:

Google Cloud Platform Overview Page

From here, you can start creating resources.

Creating a Project

Difference #1 In AWS, projects or systems are separated using AWS Accounts, where as GCP has the built-in concept of “Projects”. This allows you to stay logged in as a user and easily switch between projects. Having this logical grouping also makes it easy to terminate everything running under it when you’re done by simply deleting the project.

To create a new project, simply navigate to the main burger menu on the left-hand side, select “IAM & Admin” and then create a new project either through the “Manage Resources” or “Create a Project” options. Your Project consists of a name, an internal identifier (which can be overridden from what’s generated but can’t be changed later) and a location. Create a project with a meaningful name (I went with “Quickstart”) and choose “No Organisation” as the location.

Difference #2 – The GCP Console is always at global level – no need to switch between regions.

Setting up Billing

In order to create any resources in GCP you need to setup a Billing Account, which is as simple as providing the usual debit/credit card details such as your name, address, and card numbers. You can do this by opening the burger menu and selecting “Billing”. You’ll also be prompted with this action anytime you try to create a resource until Billing is configured..

Like AWS, GCP offers a free tier which you can use to follow this tutorial. During my experiments building a solution and then translating it into Terraform, it cost me a grand total of £1. However this was with me tearing down all resources between sessions which I’d strongly encourage you do too (and is simple enough if you’re following along with Terraform by running a terraform destroy.)

To avoid any horror stories at the end of the month, create a budget and set an alert on it to warn you when resources are bleeding you dry. Because I’m only spinning up a few resources (which will be short-lived) I’ve set my budget to £5.

To create a budget, open the burger menu and select “Billing”, then select “Budgets & alerts” from the left hand menu.

You can apply a budget to either a single project or across your whole estate, and you can also account for discounts and promotions too. With a budget you’re able to set the following:

  • Scope – What projects and resources you’d like to track in your budget
  • Amount – The spending cap you want to apply to your budget. This can be actual or forecasted (looks new, predicts if your budget is likely to be exceeded in the budget period given your resource utilisation to date)
  • Actions – You can trigger email alerts for billing admins when your budget reaches certain thresholds (i.e. 50%, 70%, 100%). You can also publish events using a pub/sub topic to facilitate automated responses to notifications too.
Overview of Budget and Alerting

Setting up GCloud CLI

https://cloud.google.com/sdk/docs/quickstart

Run the gcloud init command to configure the CLI to connect to your specific Google Account and Project. If you’re like me, you may have both business and personal accounts and projects on the same machine. If so you may need to run gcloud auth login first to switch between your accounts.

The GCloud CLI stores multiple accounts and projects as “configurations”, similar to AWS’ concept of “credentials” and “profiles”.

Setting up Terraform

I chose Terraform because it can be used with both AWS and GCP (and Azure for that matter) reducing the learning curve involved in adopting GCP.

For production use cases just like with AWS, you’re best off using a service role that adheres to security best practices (such as principle of least privilege and temporary privilege escalation) for executing automated deployments. But for getting started we’ll used the ‘gcloud auth’ approach.

We’ll be using the TF ‘local’ backend for storing state – when working in teams you can store state in cloud storage to manage concurrent changes where multiple users or tools may interact with a set of resources.

To get started, create a *.tf file in an empty directory (can be called anything, I went with main.tf) and add the following snippets to it:

variable "project_id" {
    type = string
}

locals {
  project_id = var.project_id
}

provider "google" {
  project = local.project_id
  region  = "us-central1"
  zone    = "us-central1-b"
}

The above sets up variables via the terminal (the project_id) and constants inline (the locals declaration), while the provider block configures Terraform with details about which cloud platform you’re using and details about the Project. In my sample code I went with us-central1 region – no reason behind this other than it’s the default.

At this point you’re ready to run terraform init. When you do this it validates and downloads the Google provider libraries needed to compile and interact with the GCP APIs. If you run terraform init before you add the above you’ll see a message warning you that you’ve initialised an empty directory, which means it won’t do anything.

From here, you can start adding resources to the file. Below I’ve linked to some getting started tutorials and useful references for working with the Google provider SDK which I found useful:

Useful links

https://learn.hashicorp.com/tutorials/terraform/install-cli?in=terraform/aws-get-started

https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/getting_started

https://www.terraform.io/docs/cli/init/index.html

https://registry.terraform.io/providers/hashicorp/google/latest/docs

https://stackoverflow.com/a/59056580 – automatically enable APIs (e.g. Compute Engine)

https://github.com/danielpoonwj/gcp-iac-demo

Creating a Network and VM

The first resource we’ll create is a google_compute_network, which is where our compute resources will be deployed.

resource "google_compute_network" "vpc_network" {
  name                    = "terraform-network"
  auto_create_subnetworks = false
  delete_default_routes_on_create = true
  depends_on = [
    google_project_service.compute_service
  ]
}

resource "google_compute_subnetwork" "private_network" {
  name          = "private-network"
  ip_cidr_range = "10.2.0.0/16"
  region        = "us-central1"
  network       = google_compute_network.vpc_network.self_link
}

resource "google_compute_route" "private_network_internet_route" {
  name             = "private-network-internet"
  dest_range       = "0.0.0.0/0"
  network          = google_compute_network.vpc_network.self_link
  next_hop_gateway = "default-internet-gateway"
  priority    = 100
}

There’s a few points worth mentioning about the config above:

  • Difference #3auto_create_subnetworks = false – In the GCP Console you don’t have separate consoles for each of the regions, resources from all regions are displayed on the same page. Important part here though – if you don’t explicitly override this flag then GCP will create a subnetwork for every AZ across all regions, which results in 20+ subnets being created. This was a little OTT for my needs and I also wanted to keep the deployment as similar to the AWS approach as possible, although this may be counter to GCP idioms (one for me to learn more about…)
  • delete_default_routes_on_create = true – By default GCP will create a default route to 0.0.0.0/0 on your network, effectively providing internet routing for all VMs. This may not be preferable as you may want more control over how this is configured, so I disabled this.
  • depends_on – Most of the time Terraform can identify the dependencies between resources, and initialise them in that order. Sometimes it needs a little guidance, and in this situation it was trying to create a network before the Compute Service (mentioned later…) was fully initialised. Adding this attribute prevents race conditions between resource creation.

(Later on I also had to apply the depends_on block to my google_compute_health_check, as TF also attempted to create this in parallel causing more race conditions.

Once you’ve got a network and subnetwork created, we can go ahead and configure a VM:

resource "google_compute_instance" "vm_instance" {
  name         = "nginx-instance"
  machine_type = "f1-micro"

  tags = ["nginx-instance"]

  boot_disk {
    initialize_params {
      image = "centos-7-v20210420"
    }
  }

  metadata_startup_script = <<EOT
curl -fsSL https://get.docker.com -o get-docker.sh && 
sudo sh get-docker.sh && 
sudo service docker start && 
docker run -p 8080:80 -d nginxdemos/hello
EOT

  network_interface {
    network = google_compute_network.vpc_network.self_link
    subnetwork = google_compute_subnetwork.private_network.self_link

    access_config {
      network_tier = "STANDARD"
    }
  }
}

Hopefully most of the above looks fairly similar to its AWS counterpart. The access_config.network_tier property is the main difference – Google has two different tiers (STANDARD and PREMIUM), of which the latter provides performance benefits by routing traffic through Google networks (instead of public internet) whenever it can at an additional cost.

The metadata_startup_script key is a shortcut key TF provides to configure scripts to execute when a VM starts up (similar to the UserData key in AWS). In this case, I used it to install Docker and start a instance of the nginxdemos/hello Docker Image (albeit in a slightly crude manner).

Deploying the Resources

At this point, we’re able to run our first terraform apply.

Difference #4 – You have to enable services before you can use them.

When you run a terraform apply, you may find you get an error stating that the Compute Engine API has not been used in the project before or has been disabled. When you use a service for the first time in a project you have to enable it. This can be done by clicking the “Enable” button found in the service’s landing page in the GCP web console, or you can enable it in Terraform like so:

resource "google_project_service" "compute_service" {
  project = local.project_id
  service = "compute.googleapis.com"
}

Once Terraform has successfully applied your infrastructure, you’ll have a newly created VPC and VM running within it. The first thing you might want to try is SSH into it, however you’ll probably find that the connection hangs and you aren’t able to connect.

You can triage the issue by opening the Compute Engine -> VM instances, click the kebab icon (TIL this is called a kebab icon!) and select “view network details”. Under “Ingress Analysis” in “Network Analysis” you can see that there’s no firewall rules configured, and the default is to implicitly deny traffic:

So next up, I’m going to create a firewall rule to allow inbound internet traffic into my instance. I created the following rule that allows connections from anywhere to target instances tagged with nginx-instance:

resource "google_compute_firewall" "public_ssh" {
  name    = "public-ssh"
  network = google_compute_network.vpc_network.self_link

  allow {
    protocol = "tcp"
    ports    = ["22"]
  }

  direction = "INGRESS"
  source_ranges = ["0.0.0.0/0"]
  target_tags = ["nginx-instance"]
}

Difference #5 – Firewall rules apply globally to the VPC, not specific resources.

In GCP, there isn’t an equivalent of a Security Group, which in AWS-world controls access to the resources it’s associated with. In GCP, firewall rules are associated to the network, and are applied on resources by making use of network tags.

Similar to AWS, you can tag your resources to help you manage and organise them with the use of labels. Separately though, the network tag mechanism is what’s used to apply firewall rules. In the code snippet above, you specify the rules you wish to apply, and also which network tags (or ranges if you’d prefer) to apply the rule to.

Difference #6 – You don’t define a subnet as public.

I’ll leave this article here, which for me nicely summarised the differences between AWS and GCP networking:

https://codeburst.io/vpc-networking-gcp-v-s-aws-77a80bc7cfe2

The key takeaways for me are:

  • Networking in GCP is flat, compared to the hierarchical approach taken by AWS.
  • Routing tables and firewall rules are associated directly with the VPC, not subnets or resources
  • VPCs are global concepts (not regional) and traffic automatically flows across regions
  • Subnets are regional concepts (not AZ-bound) and traffic automatically flows across AZs also
  • “GCP: Instances are made public by specifically enabling them with an external IP address; the ‘Default route to the internet‘ automatically routes Internet-bound traffic to the Internet gateway or NAT gateway (if it exists) based on the existence of the external IP address”

For example, you don’t directly assign a firewall rule to instance(s), but use network tags to apply a firewall rule to them. Similarly for routes you don’t have routing tables that you assign to subnets – you simply define a VPC-level route, the next hop traffic should take, and optionally network tags to specify which resources to apply the route to.

Network analysis showing SSH public access enabled

Ok great, so public access is now permitted and we’ve got an instance that we can SSH to and see an Nginx container running. But going forwards I want to secure this instance behind a load balancer, with no public access.

Nginx container running in VM

So how do we make instances private in TF? Simply omit the access_config element from your google_compute_instance resource if you don’t want a public IP to be assigned.

There appears to be some confusion online on what the “Private Google Access” feature does, specifically its influence on whether an instance is private or public-facing. According to the docs, instances without a public IP can only communicate with other instances in the network. This toggle allows these private instances to communicate with Google APIs whilst remaining private. Some articles allege that it’s this toggle which makes your instance public or private, although from what I’ve read I think that’s inaccurate.

Now, when I made my instance private it introduced a new problem: It broke my Docker bootstrapping, because the instance no longer has a route to the internet. Time to introduce a NAT gateway…

Difference #7 – An Internet Gateway exists by default, but you have to explicitly create a NAT gateway and router.

Some areas of the GCP documentation state that traffic will automatically flow to either the default internet gateway or a NAT gateway based on the presence of an external IP address attached to an instance. This led me to believe that a NAT gateway was also provided by default, although this turned out not to be the case when I removed the external IPs from my Nginx instances. When I did this the instances were unable to connect out to download Docker or the Nginx Docker image.

I added the following to my Terraform which re-enabled outbound connectivity, whilst keeping the instances private:

resource "google_compute_router" "router" {
  name    = "quickstart-router"
  network = google_compute_network.vpc_network.self_link
}

resource "google_compute_router_nat" "nat" {
  name                               = "quickstart-router-nat"
  router                             = google_compute_router.router.name
  region                             = google_compute_router.router.region
  nat_ip_allocate_option             = "AUTO_ONLY"
  source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
}

Creating a Load Balancer

Creating a Load Balancer is interesting. Google manages a lot of this at a global level and provides different “flavours” of load balancer applicable for different use cases:

Table representing different types of Load Balancer available in GCP

Difference #8 – To define a Load Balancer in GCP there’s a number of concepts that fit together: Front-end config, Backend Services (or Backend Buckets), Instance Groups, Health Checks and Firewall config.

For my HTTP-based service, the flow of incoming requests looks like:

Diagram depicting flow of requests through Load Balancer resources

Load Balancers aren’t a tangible resource in TF, rather the result of configuring and connecting the previously mentioned resource types. The entry point for a Load Balancer appears to be the ‘Forwarding Rule’, specifically compute_forwarding_rule.

To create a Regional-level Load Balancer using the Standard networking tier, I used the following TF:

resource "google_compute_instance_group" "webservers" {
  name        = "terraform-webservers"
  description = "Terraform test instance group"

  instances = [
    google_compute_instance.vm_instance.self_link,
    google_compute_instance.vm_instance_2.self_link
  ]

  named_port {
    name = "http"
    port = "8080"
  }
}

# Global health check
resource "google_compute_health_check" "webservers-health-check" {
  name        = "webservers-health-check"
  description = "Health check via tcp"

  timeout_sec         = 5
  check_interval_sec  = 10
  healthy_threshold   = 3
  unhealthy_threshold = 2

  tcp_health_check {
    port_name          = "http"
  }

  depends_on = [
    google_project_service.compute_service
  ]
}

# Global backend service
resource "google_compute_backend_service" "webservers-backend-service" {

  name                            = "webservers-backend-service"
  timeout_sec                     = 30
  connection_draining_timeout_sec = 10
  load_balancing_scheme = "EXTERNAL"
  protocol = "HTTP"
  port_name = "http"
  health_checks = [google_compute_health_check.webservers-health-check.self_link]

  backend {
    group = google_compute_instance_group.webservers.self_link
    balancing_mode = "UTILIZATION"
  }
}

resource "google_compute_url_map" "default" {

  name            = "website-map"
  default_service = google_compute_backend_service.webservers-backend-service.self_link
}

# Global http proxy
resource "google_compute_target_http_proxy" "default" {

  name    = "website-proxy"
  url_map = google_compute_url_map.default.id
}

# Regional forwarding rule
resource "google_compute_forwarding_rule" "webservers-loadbalancer" {
  name                  = "website-forwarding-rule"
  ip_protocol           = "TCP"
  port_range            = 80
  load_balancing_scheme = "EXTERNAL"
  network_tier          = "STANDARD"
  target                = google_compute_target_http_proxy.default.id
}

resource "google_compute_firewall" "load_balancer_inbound" {
  name    = "nginx-load-balancer"
  network = google_compute_network.vpc_network.self_link

  allow {
    protocol = "tcp"
    ports    = ["8080"]
  }

  direction = "INGRESS"
  source_ranges = ["130.211.0.0/22", "35.191.0.0/16"]
  target_tags = ["nginx-instance"]
}

Difference #9 – Load Balancer resources can be regional or global interchangeably

Depending on the network tier and level of availability you’re architecting for, you can have regional or global Load Balancers – the latter deploys your Load Balancer across all regions and utilises Google networks as much as possible to improve throughput.

However, this confused me when deciding that I only wanted a regional Load Balancer utilising the Standard network tier. According to the GCP docs, Backend Services used by HTTP(S) Load Balancing are always global, but to use the Standard network tier you have to create a regional Forwarding Rule.

This confusion was made more challenging for me by the inconsistent use of global and regional discriminators in TF resource types, which made it a struggle to hook up the resources required to create a Load Balancer. The fact that you create a normal url map and http target proxy, but then attach that to a google_compute_global_forwarding_rule, confused me somewhat!

The name of the Load Balancer appears to come from the google_compute_url_map resource… I’m not quite sure why that is? Maybe because it’s the first LB-related resource that’s created in the chain?

The GCP Console for Load Balancers can be confusing, because when you first open it after deploying the Terraform only a subset of the resources we define are visible:

GCP Console showing Load Balancer Basic view

However, by selecting the “advanced menu” link at the bottom of the page, you get an exploded view of the Load Balancer configuration:

GCP Console showing Load Balancer Advanced view

Even in the Advanced view however, you can’t view URL maps directly (referenced by target proxies). URL maps are what glue the HTTP Target Proxy and Backend Service(s) together, and it’s here where you specify any HTTP routing you’d like to apply (similar to AWS ALB Listener Rules, that map a Rule to a Target Group). You can view existing and attached URL maps by opening the target proxy they’re attached to and following the link that way.

An Instance Group is similar to an Auto Scaling Group in AWS, except you can also have Unmanaged Instance Groups which are a manually maintained group of potentially heterogeneous instances.

I used an Unmanaged Instance Group in this scenario, which combined with the Backend Service is similar to an unmanaged/manually maintained Target Group in AWS terms.

Although Health Checks are related to Instance Groups within the GCP console, they’re not directly linked. This means the service that uses the Instance Group (in our case our Load Balancer) can separately choose which Health Check is most appropriate for its use case.

External HTTP Load Balancers provided by GCP don’t run within your VPC – they’re provided as part of a managed service. Because of this and as per the Load Balancing documentation, you have to create a firewall rule that allows traffic from the following source IP ranges (managed Load Balancer) to your private VMs:

130.211.0.0/22
35.191.0.0/16

You can use the instance network tags we set up earlier to restrict where traffic from the load balancer is allowed to go to.

An awkward limitation of the Load Balancer advanced section of the web console I found was you can’t create all the configuration from here – you have to create a Load Balancer first using the basic wizard, and only then can you edit the advanced elements.

Scaling out

So at this point I have traffic flowing via a Load Balancer to my single instance which is pretty neat, but how can I demonstrate balancing traffic between two instances? Add another instance in TF and hook it up to our webservers instance group:

resource "google_compute_instance" "vm_instance_2" {
  name         = "nginx-instance-2"
  machine_type = "f1-micro"

  ...copy/paste of existing instance config
}

resource "google_compute_instance_group" "webservers" {
  name        = "terraform-webservers"
  description = "Terraform test instance group"

  instances = [
    google_compute_instance.vm_instance.self_link,
    google_compute_instance.vm_instance_2.self_link
  ]

  named_port {
    name = "http"
    port = "8080"
  }
}

And voila! We have a working example that can load balance requests between two instances… result!

Gotchas

Working with GCP and Terraform, there were a couple of gotchas that caught me out.

  • Terraform defaults a lot of the resource parameters if you don’t specify them. Although I imagine a lot of these are sensible defaults (and I suspect TF takes a similar approach with AWS), if you’re not aware of what they’re defaulted to they very quickly conflict with those settings you do specify, and personally it took me a while to identify what parameters were conflicting with each other. GCP wasn’t overly helpful in providing guidance on triaging the conflicts that were reported back through Terraform.
  • Also it appears that some parameters have different defaults between their regional and global resource counterparts, so beware when switching between the two you don’t unintentionally introduce unexpected config conflicts.
  • The field names aren’t always consistent between the web console and TF so something to watch out for. For example in a backend service the console refers to “Named port”, however in TF it’s port_name.
  • The last one is for me to work on (and find the right tooling), but a lack of compile-time checking (compared to something like CDK) which slowed me down. I had to deploy to find out if I was incorrectly mixing regional resources with global ones, which resulted in a longer feedback loop.

Final Thoughts

In conclusion, my first impressions of GCP is that it’s not too dissimilar to offerings provided by AWS once you understand the subtle differences in behaviour and terminology. The UI feels more responsive and looks slicker in my opinion, especially when you compare it to the current mixture of old and new UIs strewed across the AWS services.

Creating resources in GCP with TF was straightforward enough. The fact that VPCs are created at a global level and all resources are displayed at a global level allows you to view your whole estate from one view which I like. Just need to be mindful of regional vs global resources, specifically what permutations of these you can use and pros and cons of each.

How to improve this setup

  • Replace the Unmanaged Instance Group with a managed one. This would be similar to using an Auto Scaling Group in AWS, which could elastically scale instances and create new ones in the event of instance failures. For the purposes of this I wanted to understand all of the pieces and how they fit together, but it wouldn’t be too difficult to convert what’s here to use a Managed Instance Group instead.
  • In GCP you can use “Container-optimised” OS images that start a specified Docker image when the VM boots up. This would remove the need of the metadata_startup_script script, which would save a good few minutes on provisioning new VMs. However, I’d probably recommend something a bit more comprehensive for managing containerised applications, such as Google Kubernetes Engine (GKE)
  • If the containerisation route isn’t an option, you could consider ways to provision your VMs in a repeatable and idempotent way. For example. employing the likes of Ansible or Chef to do this provisioning at runtime, or build an OS image with something like Packer to speed up the deployment.

Learning next steps

Now I’ve gained a basic understanding of the GCP platform and how to deploy resources with Terraform, my next explorations will be into

  • GKE – how to automate provisioning of Docker Containers using a combination of Terraform to provision GKE, and then using Kubernetes to run a fleet of containers. This would be similar to the use of either AWS ECS or EKS
  • Serverless services – Now that I understand more about the lower-level networking concepts, I’ll look to explore and compare GCPs offerings to the likes of AWS Lambda, Step Function, SNS, SQS etc.

Performance Tuning Next.js

TL;DR: Next.js 9.3 introduces getStaticPaths, which allows you to generate a data-driven list of pages to render at build time, potentially allowing you to bypass server-side rendering for some use cases. You can now also use the fallback property to dynamically build pages on request, and serve the generated html instead.

On a recent project we built a website for a client using a combination of Next.js and Contentful headless CMS. The goal of the website was to offer a responsive experience across all devices whilst keeping load times to a minimum and supporting SEO.

I rather like Next.js – it combines the benefits of React with Server Side Rendering (SSR) and static html builds, enabling caching for quick initial page loads and SEO support. Once the cached SSR page has been downloaded, Next.js “hydrates” the page with React and all of the page components, completely seamlessly to the user.

The website is deployed to AWS using CloudFront and Lambda@Edge as our CDN and SSR platform. It works by executing a lambda for Origin Requests and caching the results in CloudFront. Regardless of where the page is rendered (client or server) Next.js runs the same code which in our case queries Contentful for content to display on the page, which is neat as the same code handles both scenarios.

During testing, we noticed that page requests that weren’t cached in CloudFront could take anything up to 10 seconds to render. Although this only affects requests that miss the cache, this wasn’t acceptable to us as it impacts every page that needs to be server-side generated, and the issue would also be replicated for every edge location in CloudFront. This issue only affects the first page load of a visitors session however, as subsequent requests are handled client-side and only the new page content and assets are downloaded.

Whilst investigating the issue we spotted that the majority of processing time was spent in the lambda. We added extra logging to output the elapsed time at various points in the lambda, and then created custom CloudWatch metrics from these to identify where most of the time was incurred.

We identified that the additional overhead was caused by javascript requiring the specific page’s javascript file embedded within the lambda, which is dynamically loaded for the page requested. It’s dynamically loaded to avoid loading all page assets when only rendering a single page, which would add considerable and unnecessary startup time to the lambda.

The lambda we used was based on the Next.js plugin available for the serverless framework, but as we were using Terraform we took the bits we needed from here to make it work https://github.com/serverless-nextjs/serverless-next.js/blob/master/README.md

Due to the overhead from the require statement, we experimented with the resource allocation given to the lambda. It was initially set to 128mb, so we played with various configurations and applied load against the website using JMeter to see if extra resources improved the responsiveness.

We found that by tweaking the memory allocation of the lambda, we could improve the average startup time from ~10 seconds to ~2 seconds. We found that the sweet spot was 368mb, just as the curve begins to flatten out. On the surface, increasing from 128mb to 368mb triples our lambda costs, however these are negligible as the lambda only runs on cache misses with most of our requests served from the CloudFront cache. That said adding extra resources for the sake of milliseconds would be superfluous and more expensive.

This improvement in speed was good enough for us, considering it impacted only a small percentage of visitors. A colleague of mine afterwards however suggested a couple of further refinements that could be made, which would reduce this impact even further. These options would require additional development effort which for us wasn’t possible at the time, but would make the website really responsive for all visitors.

Other strategies for mitigating the cold start issue

Multiple cache behaviours for different paths

By identifying which areas of your website are updated more often than others, you can mitigate the lambda issue by tweaking the cache expiries associated with them in CloudFront. For example, your homepage may change several times a day, whereas your news articles once published might stay fairly static. In this case, you could apply a short cache expiry to the root of your website / and a longer one for /news/*.

Invalidating CloudFront caches proactively

You could proactively invalidate CloudFront caches whenever content on your website changes. CloudFront allows you to specify a path to evict the cache for, so you can be really specific on what you want to invalidate. In our scenario, we could use Contentful webhooks to be notified when a piece of content is updated or removed, and use a lambda to trigger a cache invalidation for that path.

Generating dynamic pages at build time

As of Next.JS 9.3 there is now a getStaticPaths function, which allows you to generate dynamic pages (that use placeholders i.e. /news/[article-uri] at build time. This can significantly reduce the need for SSR depending on your use case.

Initially, you had to generate all of these pages as part of your build, which could be quite inefficient (e.g. rebuilding a website that has thousands of blog articles every time a new blog is published). However, as of Next.JS 9.3 you can now generate static pages on demand as announced here using the fallback key on getStaticPaths

In our project, we could use Contentful WebHooks to trigger website builds, passing through the URI of the new page into the build pipeline to specify what part of the website to rebuild. If you have a page template for /news/* for example, you’d possibly have to trigger a rebuild of all news.

Doing this would negate a lot of the above, as for us we could build a lot of the website upfront, and then new blog articles could be built on demand when visitors accessed them. Next.js’ fallback functionality notifies you when a page is being built for the first time, allowing you to present an intermediary “page loading” screen for the first visitor who triggers the build, giving them visual feedback and keeping them engaged whilst the page builds behind the scenes.

Hopefully this overview gives you some understanding of the potential performance issues faced when using SSR with Next.js, and also the variety of options available to you when tuning your application.

More details of Next.js’ Server Side Rendering and Static Generation capabilities can be found here:

https://nextjs.org/blog/next-9-4

https://nextjs.org/docs/basic-features/data-fetching

https://nextjs.org/docs/basic-features/pages