dimitri/pgcopydb

By dimitri

Updated 7 months ago

Copy a Postgres database to a target Postgres server (pg_dump | pg_restore on steroids)

Image

6.4K

pgcopydb

pgcopydb is a tool that automates running pg_dump | pg_restore between two running Postgres servers. To make a copy of a database to another server as quickly as possible, one would like to use the parallel options of pg_dump and still be able to stream the data to as many pg_restore jobs.

The idea would be to use pg_dump --jobs=N --format=directory postgres://user@source/dbname | pg_restore --jobs=N --format=directory -d postgres://user@target/dbname in a way. This command line can't be made to work, unfortunately, because pg_dump --format=directory writes to local files and directories first, and then later pg_restore --format=directory can be used to read from those files again.

When using pgcopydb it is possible to achieve the result outlined before with this simple command line:

$ export PGCOPYDB_SOURCE_PGURI="postgres://user@source.host.dev/dbname"
$ export PGCOPYDB_TARGET_PGURI="postgres://role@target.host.dev/dbname"

$ pgcopydb copy-db --table-jobs 8 --index-jobs 2

Then pgcopydb implements the following steps:

  1. pgcopydb produces pre-data section and the post-data sections of the dump using Postgres custom format.

  2. The pre-data section of the dump is restored on the target database, creating all the Postgres objects from the source database into the target database.

  3. pgcopydb gets the list of ordinary and partitioned tables and for each of them runs COPY the data from the source to the target in a dedicated sub-process, and starts and control the sub-processes until all the data has been copied over.

    Postgres catalog table pg_class is used to get the list of tables with data to copy around, and the reltuples is used to start with the tables with the greatest number of rows first, as an attempt to minimize the copy time.

  4. In each copy table sub-process, as soon as the data copying is done, then pgcopydb gets the list of index definitions attached to the current target table and creates them in parallel.

    The primary indexes are created as UNIQUE indexes at this stage.

    Then the PRIMARY KEY constraints are created USING the just built indexes. This two-steps approach allows the primary key index itself to be created in parallel with other indexes on the same table, avoiding an EXCLUSIVE LOCK while creating the index.

  5. Then VACUUM ANALYZE is run on each target table as soon as the data and indexes are all created.

  6. The final stage consists now of running the rest of the post-data section script for the whole database, and that's where the foreign key constraints and other elements are created.

    The post-data script is filtered out using the pg_restore --use-list option so that indexes and primary key constraints already created in step 4. are properly skipped now.

    This is done by the per-table sub-processes sharing the dump IDs of the post-data items they have created with the main process, which can then filter out the pg_restore --list output and comment the already created objects from there, by dump ID.

Using the docker container

Here is a docker image definition that uses and extend this image:

FROM pgcopydb

USER root
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
    ca-certificates \
    curl \
    git \
	&& rm -rf /var/lib/apt/lists/*

WORKDIR /usr/src/
RUN git clone --depth 1 https://github.com/devrimgunduz/pagila.git

WORKDIR /usr/src/pgcopydb
COPY ./copydb.sh copydb.sh

USER docker
CMD /usr/src/pgcopydb/copydb.sh

Here is a docker-compose file as found in the tests/pagila directory of the project:

version: "3.9"  # optional since v1.27.0
services:
  source:
    image: postgres:13-bullseye
    expose:
      - 5432
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: h4ckm3
      POSTGRES_HOST_AUTH_METHOD: trust
  target:
    image: postgres:13-bullseye
    expose:
      - 5432
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: h4ckm3
      POSTGRES_HOST_AUTH_METHOD: trust
  pgcopydb:
    build: .
    environment:
      PGCOPYDB_SOURCE_PGURI: postgres://postgres:h4ckm3@source/postgres
      PGCOPYDB_TARGET_PGURI: postgres://postgres:h4ckm3@target/postgres
      PGCOPYDB_TARGET_TABLE_JOBS: 4
      PGCOPYDB_TARGET_INDEX_JOBS: 2

Authors

License

Copyright (c) The PostgreSQL Global Development Group.

This project is licensed under the PostgreSQL License, see LICENSE file for details.

This project includes bundled third-party dependencies, see NOTICE file for details.

Docker Pull Command

docker pull dimitri/pgcopydb