Postgres replication server in GCP out of sync

Keywords: PostgreSQL - Google Cloud Platform - Technical issue - Other
bnsupport ID: d82a3970-0a9b-b004-13c8-0a27eb980c76
Description:
Hello,

I’ve inherited a production and test environment that both have broken replication, unsure of what steps I can take to remedy this.

I didn’t initially set up the servers, but I believe we’re using out-of-the-box
PostgreSQL with Replication from GCP, though it looks like we’ve changed the configuration of the multi-tiered solution so there is one master node and only one slave node instead of two (which I thought I read was the default?).

The issue is that probably months ago at this point the master and slave got out of sync and a WAL segment got deleted from the master that the slave needs. Ergo, logs are almost entirely:

requested WAL segment 000000010000000B00000085 has already been removed

I’m pretty new to working on the cloud and in node clusters. Is the only solution at this point to create a new slave? If so, how would I do that?

I believe that we’re using the default configuration and here’s some other info:

postgres=# SELECT version();
                                                 version                                     
            
---------------------------------------------------------------------------------------------
------------
 PostgreSQL 10.7 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170
516, 64-bit

Please note pg_stat_replication is currently empty when run on the master node:

postgres=# SELECT client_addr, state FROM pg_stat_replication;
 client_addr | state 
-------------+-------
(0 rows)

Hi @helpmeexitvim,

Could you confirm if you are using out-of-the-box and could you check you are running only two VM, you can check in that console of google?

Regards,
Ibone.

Ibone,

I’m not sure how to confirm if our solution is 100% out of the box. From the deployment manager this looks to be the template properties that we set up our VMs with:

type: postgresql.jinja
applicationDatabase: dbName-test
dataDiskSizeGb: 10.0
dataDiskType: pd-ssd
machineType": n1-standard-1
network: default
numSlaves: 1.0
subnetwork: default
zone: us-west1-a

This is happening on two different deployments, the above is for our test environment but our production environment is identical (with the same replication issue).

Does that answer your question? Is this issue because we’re only using one slave node instead of 2?

Hi @helpmeexitvim,

Could you check these links?
https://stackoverflow.com/questions/58771648/streaming-replication-is-failing-with-wal-segment-has-already-been-moved
https://github.com/zalando/patroni/issues/1318

I hope that helps you,
Ibone,

I had not tried those, because they aren’t bitnami-specific. I created a ticket because I’m unfamiliar with Bitnami and since out-of-the-box GCP Postgres with Replication is automatically configured, so I was unsure if there was something extra/special/different I needed to do.

I’ll check out those tickets but it sounds like there’s no difference or special considerations for Bitnami set up.

Hi @helpmeexitvim,

Our guides there about the basic configuration of the application.
If there is a problem with the application iit usually passes anyway, no matter if it’s Bitnami or other.
Like your problem it’s about the application, I refer you to that’s links.

Regards,
Ibone.

Hi Ibone,

Just wanted to update that these solutions ultimately worked.

We already had the configuration backbone, so all that was necessary was:

  1. rm -rf /bitnami/postgresql/data
  2. pg_basebackup -h 10.0.0.126 -U replica -Xs -P -R -D /bitnami/postgresql/data
  3. chown -R postgres.postgres /bitnami/postgresql/data
  4. chmod 700 /opt/bitnami/postgresql/data
  5. sudo -u postgres pg_ctl start -D /opt/bitnami/postgresql/data

Thanks for your assistance, my issue is resolved.

Hi @helpmeexitvim,

Thanks for sharing your solution.
I close the ticket, do not hesitate to write us back if you have any other questions.

Regards,
Ibone