How It's Made

Creating a Logical Replica from a Snapshot in RDS Postgres

Christos Christoudias

Christos Christoudias

Feb 5, 2020

In today’s world of data breaches and privacy concerns, every measure you can take to protect your data is another brick in the wall defending you. At Instacart, no matter how confident we feel about our security, we are always evaluating ways to make it even better.

A few years ago, Amazon Web Services (AWS) introduced encryption-at-rest for their Relational Database Service (RDS) for Postgres. It is a simple feature to include for new databases, but adding it to existing instances is a challenge. Following AWS’ recommended procedure, we would have to:

  1. Take our database offline
  2. Take a snapshot
  3. Encrypt the snapshot
  4. Restore the snapshot

For a database with terabytes of storage, following this procedure would mean many hours of downtime. We only have a two hour window between our last deliveries in Hawaii and our first delivery on the East Coast. Absorbing 3 hours of downtime isn’t feasible for a company of Instacart’s size.

A feature like encryption-at-rest is a requirement. Our shoppers and customers deserve it.

As we explored our options, we took a long look at logical replication which is available with Postgres 10+. In logical replication, row changes are sent to a replica instead of physical pages. This means that the replica can be a different engine version, on a different filesystem or have a different schema altogether.

Creating a logical replica from scratch traditionally means copying all of the data from the primary to the replica. But for an instance that’s a few terabytes large, it would take over a week for the replica to be caught up after being created. Although that’s feasible to work with, that would mean running all schema changes on both instances and adding lots of time to running simple tests and simulations.

We wondered if we could use the RDS snapshot/restore process that was already in place and have the restored snapshot resume replication from its restore point. We weren’t able to find any kind of documentation saying it was possible. When we sought feedback from experts at AWS, they told us it was not possible.

However one of our core values at Instacart is “Of course, but maybe”, meaning we should never assume anything is impossible or until we prove it is or isn’t ourselves.

True to our values, we set out to prove it was possible… or prove it was not.

It was.

Here is how you do it.

Step 1: Create a Subscription and Replication Slot

Image for post

On the origin, create a publication and a corresponding replication slot for the logical replica to connect to.

database_name=> CREATE PUBLICATION encrypted_replica1 FOR ALL TABLES;
CREATE PUBLICATION
database_name=> SELECT * FROM pg_create_logical_replication_slot('encrypted_replica1', 'pgoutput');
     slot_name     |    lsn
-------------------+----------
encrypted_replica1 | C4A1/715F3088

Now there is a replication slot with no subscriber. Every change that is happening on your database will queue up in the slot until a replica connects to apply and drain the changes.

Step 2: Create Snapshot and Encrypt

Image for post

While the replication slot stores your changes, take a snapshot:

1. Open the AWS console and browse to your instance
2. Select “Take Snapshot” and enter a name for your snapshot
3. Wait a few minutes or hours, depending the size of the changes since the last snapshot

In our case, the primary purpose was to create an encrypted instance so we’ll do that here. Do this by simply copying the snapshot and clicking the box that says “Enable Encryption”

Step 3: Restore Snapshot

Image for post

Once the snapshot is encrypted (if you so desire), restore the snapshot to an instance. Select the basic settings you want like the instance size, IOPS, etc…

Image for post

Once the instance is restored, navigate to it in the AWS console and go to the Logs and Events tab. Scroll down to the logs and page all the way to the end where the most recent log file is. View the most recent log file, and look for a line that includes “redo done at”.

Image for post

019-06-13 03:40:28 UTC::@:[7899]:LOG: redo done at C4A1/7C021F48

The C4A1/7C021F48 represents something called the Log Sequence Number (LSN). This represents the last entry in the Write-Ahead-Log (WAL) that Postgres uses to write transactions. This is the point in time that the Snapshot was restored at.

You can confirm that this is the proper current point in time by looking up the current LSN on the restored snapshot with SELECT pg_current_wal_lsn();

You then create a subscription from the restored snapshot to the original database, but don’t enable it yet.

CREATE SUBSCRIPTION logical_replica1
CONNECTION 'postgres://adminusername:password@<origin-db URL>/dbname'
PUBLICATION logical_replica1 --note that this is the same as the publication name on origin
WITH (
  copy_data = false,
  create_slot = false,
  enabled = false,
  synchronous_commit = false,
  connect = true,
  slot_name = 'logical_replica1'
);

Step 4: Advance Replication Slot in Sync with Snapshot

Image for post

So now you have a replication slot that was created at one LSN, and an instance from a snapshot that was taken at a later LSN. You now need to advance the replication slot so its LSN lines up with the restored snapshot.

From the destination database, start by getting the “roname”.

database_name=> SELECT * FROM pg_replication_origin;
roident  | roname
---------+---------------
2        | pg_3474457851
(1 row)

Use the “roname” and the LSN from the AWS logs to advance the replication slot.

-- SELECT pg_replication_origin_advance('<roname>', '<LSN from the console>');SELECT pg_replication_origin_advance('pg_3474457851', 'C4A1/7C021F48');

You have now advanced the replication slot to line up with the restored snapshot’s current LSN.

Step 5: Enable Subscription and Drain Replication Slot

Image for post

The final step is to enable the subscription on the destination database.

ALTER SUBSCRIPTION logical_replica1 ENABLE;

You’ll be able to see the status of the replication slot by querying it from the origin.

database_name=> SELECT slot_name, confirmed_flush_lsn as flushed, pg_current_wal_lsn(), (pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance FROM pg_catalog.pg_replication_slots WHERE slot_type = 'logical';slot_name        |    flushed    | pg_current_wal_lsn | lsn_distance
-----------------+---------------+--------------------+------------
logical_replica1 | 47D97/CF32980 | 47D97/CF3BAC8      | 37192(1 row)

The confirmed_flush_lsn represents the last LSN that was sent to the replica. The pg_current_wall_lsn represents where the database is now. A lsn_distance of 0 means the replica is caught up.

When restoring from a snapshot, the new RDS instance will start all of its reads by hitting S3 so it’ll be slow. This means the LSN distance might actually grow even after you start “draining” the slot. As long as the flushed LSN keeps moving, it’ll eventually catch up.

If there are any problems, you’ll see them in the RDS logs of the logical replica.

Promoting the Logical Replica

In most cases, you are creating a logical replica to replace your primary database. Promoting it is a matter of a few simple steps:

  1. Pause writes on the primary
  2. Terminate the subscription from the destination to the origin
  3. Reset all sequences in the destination
  4. Point all traffic to the new primary

The very first time we successfully stood up the logical replica, we felt like it was too good to be true. The first thing we did was make sure there wasn’t any data missing. We double checked it. We triple checked it. We checked it a few other ways. It was all there.

We let it run for a few days to see if replication would run into issues. It kept humming along with no issues at all.

We repeated the process a 2nd and 3rd time. Each time, it connected flawlessly.

Almost a year later, we have successfully stood up over two dozen logical replicas from RDS snapshots. This method is extremely versatile, enabling us to perform multiple upgrades to our databases, including migrating to encryption-at-rest, performing major schema changes, and creating logical shards.

Although each promotion of a logical replica feels more and more comfortable, the feeling of elation that it actually worked never gets old🍾.

Want to work on challenges like these? Instacart Engineering and Product is hiring! Check out our current openings.

Special thanks and credit to Marco Montagna for providing most of the creativity behind this technique.

Christos Christoudias

Christos Christoudias

Christos Christoudias is a member of the Instacart team. To read more of Christos Christoudias's posts, you can browse the company blog or search by keyword using the search bar at the top of the page.

Most Recent in How It's Made

One Model to Serve Them All: How Instacart deployed a single Deep Learning pCTR model for multiple surfaces with improved operations and performance along the way

How It's Made

One Model to Serve Them All: How Instacart deployed a single Deep Learning pCTR model for multiple surfaces with improved operations and performance along the way

Authors: Cheng Jia, Peng Qi, Joseph Haraldson, Adway Dhillon, Qiao Jiang, Sharath Rao Introduction Instacart Ads and Ranking Models At Instacart Ads, our focus lies in delivering the utmost relevance in advertisements to our customers, facilitating novel product discovery and enhancing…...

Dec 19, 2023
Monte Carlo, Puppetry and Laughter: The Unexpected Joys of Prompt Engineering

How It's Made

Monte Carlo, Puppetry and Laughter: The Unexpected Joys of Prompt Engineering

Author: Ben Bader The universe of the current Large Language Models (LLMs) engineering is electrifying, to say the least. The industry has been on fire with change since the launch of ChatGPT in November of…...

Dec 19, 2023
Unveiling the Core of Instacart’s Griffin 2.0: A Deep Dive into the Machine Learning Training Platform

How It's Made

Unveiling the Core of Instacart’s Griffin 2.0: A Deep Dive into the Machine Learning Training Platform

Authors: Han Li, Sahil Khanna, Jocelyn De La Rosa, Moping Dou, Sharad Gupta, Chenyang Yu and Rajpal Paryani Background About a year ago, we introduced the first version of Griffin, Instacart’s first ML Platform, detailing its development and support for end-to-end ML in…...

Nov 22, 2023