pg_parquet extension EARLY ACCESS
The pg_parquet extension allows you to read and write Parquet files, which are located in S3, Azure Blob Storage, Google Cloud Storage, http(s) endpoints or file system, from PostgreSQL via COPY TO/FROM commands. It depends on Apache Arrow project to read and write Parquet files and pgrx project to extend PostgreSQL's COPY command.
Enable pg_parquet
To enable the pg_parquet extension:
-
Add
pg_parquettoshared_preload_librariesin the PostgreSQL server configuration parameters using the YB-TServer --ysql_pg_conf_csv flag:--ysql_pg_conf_csv="shared_preload_libraries='pg_parquet'"Note that modifying
shared_preload_librariesrequires restarting the YB-TServer. -
Enable the extension:
CREATE EXTENSION pg_parquet;
Use pg_parquet
You can use pg_parquet to do the following:
- Export tables or queries to Parquet files.
- Ingest data from Parquet files to tables.
- Inspect the schema and metadata of Parquet files.
COPY to/from Parquet files from/to YSQL tables
You can use COPY command to read and write from/to Parquet files. An example of how to write a YSQL table, with complex types, into a Parquet file and then to read the Parquet file content back to the same table is as follows:
-- create composite types
CREATE TYPE product_item AS (id INT, name TEXT, price float4);
CREATE TYPE product AS (id INT, name TEXT, items product_item[]);
-- create a table with complex types
CREATE TABLE product_example (
id int,
product product,
products product[],
created_at TIMESTAMP,
updated_at TIMESTAMPTZ
);
-- insert some rows into the table
INSERT INTO product_example VALUES (
1,
ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product,
ARRAY[ROW(1, NULL, NULL)::product, NULL],
now(),
'2022-05-01 12:00:00-04'
);
-- copy the table to a parquet file
COPY product_example TO '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');
-- show table
SELECT * FROM product_example;
-- copy the parquet file to the table
COPY product_example FROM '/tmp/product_example.parquet';
-- show table
SELECT * FROM product_example;
Inspect Parquet schema
Use the following SELECT query to discover the schema of the Parquet file at a given URI:
SELECT * FROM parquet.schema('/tmp/product_example.parquet') LIMIT 10;
uri | name | type_name | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type
------------------------------+--------------+------------+-------------+-----------------+--------------+----------------+-------+-----------+----------+--------------
/tmp/product_example.parquet | arrow_schema | | | | 5 | | | | |
/tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 0 |
/tmp/product_example.parquet | product | | | OPTIONAL | 3 | | | | 1 |
/tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 2 |
/tmp/product_example.parquet | name | BYTE_ARRAY | | OPTIONAL | | UTF8 | | | 3 | STRING
/tmp/product_example.parquet | items | | | OPTIONAL | 1 | LIST | | | 4 |
/tmp/product_example.parquet | list | | | REPEATED | 1 | | | | |
/tmp/product_example.parquet | element | | | OPTIONAL | 3 | | | | 5 |
/tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 6 |
/tmp/product_example.parquet | name | BYTE_ARRAY | | OPTIONAL | | UTF8 | | | 7 | STRING
(10 rows)
Inspect Parquet metadata
Use the following SELECT query to discover the detailed metadata of the Parquet file, such as column statistics, at a given URI:
SELECT uri, row_group_id, row_group_num_rows, row_group_num_columns, row_group_bytes, column_id, file_offset, num_values, path_in_schema, type_name FROM parquet.metadata('/tmp/product_example.parquet') LIMIT 1;
uri | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type_name
------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-----------
/tmp/product_example.parquet | 0 | 1 | 13 | 842 | 0 | 0 | 1 | id | INT32
(1 row)
SELECT stats_null_count, stats_distinct_count, stats_min, stats_max, compression, encodings, index_page_offset, dictionary_page_offset, data_page_offset, total_compressed_size, total_uncompressed_size FROM parquet.metadata('/tmp/product_example.parquet') LIMIT 1;
stats_null_count | stats_distinct_count | stats_min | stats_max | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size
------------------+----------------------+-----------+-----------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------
0 | | 1 | 1 | GZIP(GzipLevel(6)) | PLAIN,RLE,RLE_DICTIONARY | | 4 | 42 | 101 | 61
(1 row)
Use the following SELECT query to discover file level metadata of the Parquet file, such as format version, at a given URI:
SELECT * FROM parquet.file_metadata('/tmp/product_example.parquet')
uri | created_by | num_rows | num_row_groups | format_version
------------------------------+------------+----------+----------------+----------------
/tmp/product_example.parquet | pg_parquet | 1 | 1 | 1
(1 row)
Use the following SELECT query to get custom key-value metadata of the Parquet file at a given uri:
SELECT uri, encode(key, 'escape') as key, encode(value, 'escape') as value FROM parquet.kv_metadata('/tmp/product_example.parquet');
uri | key | value
------------------------------+--------------+---------------------
/tmp/product_example.parquet | ARROW:schema | /////5gIAAAQAAAA ...
(1 row)
Inspect Parquet column statistics
Use the following SELECT query to discover the column statistics of the Parquet file, such as min and max value for the column, at a given URI:
SELECT * FROM parquet.column_stats('/tmp/product_example.parquet');
column_id | field_id | stats_min | stats_max | stats_null_count | stats_distinct_count
-----------+----------+----------------------------+----------------------------+------------------+----------------------
4 | 7 | item 1 | item 2 | 1 |
6 | 11 | 1 | 1 | 1 |
7 | 12 | | | 2 |
10 | 17 | | | 2 |
0 | 0 | 1 | 1 | 0 |
11 | 18 | 2025-03-11 14:01:22.045739 | 2025-03-11 14:01:22.045739 | 0 |
3 | 6 | 1 | 2 | 1 |
12 | 19 | 2022-05-01 19:00:00+03 | 2022-05-01 19:00:00+03 | 0 |
8 | 15 | | | 2 |
5 | 8 | 1 | 2 | 1 |
9 | 16 | | | 2 |
1 | 2 | 1 | 1 | 0 |
2 | 3 | product 1 | product 1 | 0 |
(13 rows)
Object Store support
pg_parquet supports reading and writing Parquet files from/to S3, Azure Blob Storage, http(s) and Google Cloud Storage object stores.
Required roles
To write into an object store location, you need to grantparquet_object_store_write role to your current postgres user. Similarly, to read from an object store location, you need to grant parquet_object_store_read role to your current postgres user.
For information on how to grant roles, refer to GRANT.
S3 Storage
The basic way to configure object storage is by creating the standard ~/.aws/credentials and ~/.aws/config files:
$ cat ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
$ cat ~/.aws/config
[default]
region = eu-central-1
Alternatively, you can use environment variables when starting postgres to configure the S3 client as described in the following table:
| Variable | Description | |||
|---|---|---|---|---|
AWS_ACCESS_KEY_ID |
Access key ID of the AWS account. | |||
AWS_SECRET_ACCESS_KEY |
Secret access key of the AWS account. | |||
AWS_SESSION_TOKEN |
Session token for the AWS account. | |||
AWS_REGION |
Default region of the AWS account. | |||
AWS_ENDPOINT_URL |
The endpoint. | |||
AWS_SHARED_CREDENTIALS_FILE |
An alternative location for the credentials file (only via environment variables). | |||
AWS_CONFIG_FILE |
An alternative location for the configuration file (only via environment variables). | |||
AWS_PROFILE |
Name of the profile from the credentials and configuration file (default profile name is default) (only via environment variables). | |||
AWS_ALLOW_HTTP |
Allows HTTP endpoints (only via environment variables). |
The following table summarizes key S3 client configuration and authorization methods' priorities.
| Configuration source priority order | Supported authorization methods' priority order |
|---|---|
|
|
The supported S3 URI formats are as follows:
s3://<bucket>/<path>https://<bucket>.s3.amazonaws.com/<path>https://s3.amazonaws.com/<bucket>/<path>
Azure Blob Storage
The basic way to configure object storage is by creating the standard ~/.azure/config file:
$ cat ~/.azure/config
[storage]
account = devstoreaccount1
key = Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==
Alternatively, you can use environment variables when starting postgres to configure the Azure Blob Storage client as described in the following table:
| Variable | Description |
|---|---|
AZURE_STORAGE_ACCOUNT |
Storage account name of the Azure Blob. |
AZURE_STORAGE_KEY |
Storage key of the Azure Blob. |
AZURE_STORAGE_CONNECTION_STRING |
Connection string for the Azure Blob (overrides any other configuration). |
AZURE_STORAGE_SAS_TOKEN |
Storage SAS token for the Azure Blob. |
AZURE_TENANT_ID |
Tenant ID for client secret authentication (only via environment variables). |
AZURE_CLIENT_ID |
Client ID for client secret authentication (only via environment variables). |
AZURE_CLIENT_SECRET |
Client secret for client secret authentication (only via environment variables). |
AZURE_STORAGE_ENDPOINT |
The endpoint (only via environment variables). |
AZURE_CONFIG_FILE |
An alternative location for the configuration file (only via environment variables). |
AZURE_ALLOW_HTTP |
Allows HTTP endpoints (only via environment variables). |
The following table summarizes key Azure Blob client configuration and authorization methods' priorities.
| Configuration source priority order | Supported authorization methods' priority order | |
|---|---|---|
|
|
The supported Azure Blob Storage URI formats are as follows:
az://<container>/<path>azure://<container>/<path>https://<account>.blob.core.windows.net/<container>
HTTP(s) Storage
HTTPS URIs are supported by default. You can set ALLOW_HTTP environment variables to allow HTTP URIs.
Google Cloud Storage
The basic way to configure object storage is by creating a JSON configuration file like ~/.config/gcloud/application_default_credentials.json (can be generated by gcloud auth application-default login):
$ cat ~/.config/gcloud/application_default_credentials.json
{
"gcs_base_url": "http://localhost:4443",
"disable_oauth": true,
"client_email": "",
"private_key_id": "",
"private_key": ""
}
Alternatively, you can use following environment variables when starting postgres to configure the Google Cloud Storage as described in the following table:
| Variable | Description |
|---|---|
GOOGLE_SERVICE_ACCOUNT_KEY |
JSON serialized service account key (only via environment variables). |
GOOGLE_SERVICE_ACCOUNT_PATH |
An alternative location for the configuration file (only via environment variables). |
Supported Google Cloud Storage URI format is gs://<bucket>/<path>.
Copy options
The COPY TO command options pg_parquet supports is described in the following table:
| Option | Description |
|---|---|
format parquet |
Specify this option to read or write Parquet files that do not end with the .parquet[.<compression>] extension. |
file_size_bytes <string> |
Total file size per Parquet file. When set, Parquet files with the target size are created under a parent directory (named the same as the file name). By default, when not specified, a single file is generated without a parent folder. You can specify total bytes without a unit (for example, file_size_bytes 2000000), or with a unit (KB, MB, or GB, for example, file_size_bytes '1MB'). |
field_ids <string> |
Field IDs that are assigned to the fields in the Parquet file schema. By default, no field IDs are assigned. Pass auto to let pg_parquet generate field IDs. You can pass a JSON string to explicitly provide the field IDs. |
row_group_size <int64> |
Number of rows in each row group while writing Parquet files. Default is 122880. |
row_group_size_bytes <int64> |
Total byte size of rows in each row group while writing Parquet files. Default is row_group_size * 1024. |
compression <string> |
The compression format to use while writing Parquet files. Supported formats are uncompressed, snappy (default), gzip, brotli, lz4, lz4raw, and zstd. If not specified, the format is determined by the file extension. |
compression_level <int> |
The compression level to use while writing Parquet files. This is only supported for gzip, zstd, and brotli. The default is 6 for gzip (0-10), 1 for zstd (1-22), and 1 for brotli (0-11). |
parquet_version <string> |
The writer version of the Parquet file. By default, it is set to v1 for better interoperability. You can set it to v2 to unlock new encodings. |
The COPY FROM command options pg_parquet supports is described in the following table:
| Option | Description |
|---|---|
format parquet |
Specify this option to read or write Parquet files which does not end with .parquet[.<compression>] extension |
match_by <string> |
Method to match Parquet file fields to PostgreSQL table columns. Available methods are position (default) and name. You can set it to name to match the columns by their name rather than by their position in the schema (default). Match by name is useful when field order differs between the Parquet file and the table, but their names match. |