Replicate RDS PostgreSQL to BigQuery using Datastream CDC

RK Kuppala
The Cloudside View
Published in
4 min readSep 21, 2022

--

Continuing the series of posts we started with Replicate RDS PostgreSQL to BigQuery using Airbyte CDC, in this post, we will cover the all-new Datastream for BigQuery.

GCP has recently released Datastream for BigQuery, which allows access to streaming data from MySQL, PostgreSQL, AlloyDB, and Oracle databases directly to BigQuery without needing an additional pipeline line Dataflow. This is pretty cool, and we took it for a spin with an RDS PostgreSQL source and BigQuery as the target. So let’s get started!

How does it work?

Datastream is a CDC tool that taps into pglogical replication for a PostgreSQL source. All you need to do is set the replica identity on the tables you are interested in replicating, create a publication and a replication slot and off you go!

Prepare RDS PostgreSQL

We assume that AWS and GCP VPCs have been already connected with a site-to-site VPN. Let’s create a dedicated user for Airbyte replication with minimum necessary privileges (read-only).

Publication, Replication slot, and Datastream

We will now create a publication with all the tables we are interested in replicating to BigQuery. Before that, set the replica identity to default. Repeat this for all the tables you intend to sync

ALTER TABLE tbl1 REPLICA IDENTITY DEFAULT;
.
.
ALTER TABLE tbl-n REPLICA IDENTITY DEFAULT;

Create a publication with all the tables you intend to sync

CREATE PUBLICATION pub_datastream 
FOR TABLE myschema.table01
,myschema.table02
,myschema.table03
,myschema.table04
,myschema.table05;

Create a replication slot

SELECT pg_create_logical_replication_slot('datastream', 'pgoutput');

Go to Datastream and start configuring. The options are self explanatory

In the next step, you will be asked to configure a connection profile, which is basically how datastream connects to RDS. In our case, we had AWS to GCP VPN setup already so we used the internal IP address. You may use the public endpoint with firewall rules allowing access.

Next, fill the replication slot name and the publication name we created above.

Choose the tables you want to sync and similar to the source connection profile, configure a destination connection profile — BigQuery in this case.

Once you are done, you should see the summary screen like this

Click on “START” on the stream page. Sit back and relax, because it will initially do a backfill of all data.

Monitoring

The great thing about Datastream to BigQuery integration is the simplicity of it. While being extremely simple to set up, it also has comprehensive monitoring, to track unsupported events, status of the sync, number of rows processed etc.,

You can also go to the stream dashboard and look at all the tables that are currently part of the sync and check their individual status. If needed, you can initiate a manual backfill (full load) at an individual table level

It also has useful data freshness tracker.

And best of all, the logviewer!

Hope you found this useful! Happy data ingestion :)

--

--