Back to source list
s3
Official
Premium

S3

The CloudQuery S3 source plugin reads parquet files and loads them into any supported CloudQuery destination (e.g. PostgreSQL, BigQuery, Snowflake, and more)

Publisher

cloudquery

Latest version

v1.3.23

Type

Source

Platforms
Date Published

Price per 1M rows

Starting from $15

monthly free quota

1M rows

Set up process #


brew install cloudquery/tap/cloudquery

1. Download CLI and login

See installation options

2. Create source and destination configs

Plugin configuration

cloudquery sync s3.yml postgresql.yml

3. Run the sync

CloudQuery sync

Overview #

The CloudQuery S3 source plugin reads parquet files and loads them into any supported CloudQuery destination (e.g. PostgreSQL, BigQuery, Snowflake, and more). The S3 source plugin assumes that all unique prefixes in the S3 bucket are unique tables, and for those objects in the root of the bucket, the table name is the name of the object. For example if you have the following objects in your s3 bucket:
s3://<bucket>/datafile_0.parquet
s3://<bucket>/datafile_1.parquet
s3://<bucket>/data/2024/datafile_1.parquet
s3://<bucket>/data/2024/02/14/14/15/datafile_3.parquet
s3://<bucket>/data/2024/02/14/14/15/datafile_4.parquet
CloudQuery will sync the following tables:
datafile_0.parquet --> datafile_0
datafile_1.parquet --> datafile_1
data/2024/datafile_1.parquet --> data_2024
data/2024/02/14/14/15/datafile_3.parquet --> data_2024_02_14_14_15
data/2024/02/14/14/15/datafile_4.parquet --> data_2024_02_14_14_15

Configuration #

kind: source
spec:
  name: s3

  path: cloudquery/s3
  registry: cloudquery
  version: "v1.3.23"
  tables: ["*"]
  destinations: ["postgresql"]

  # Learn more about the configuration options at https://cql.ink/s3_source
  spec:
    # TODO: Update it with the actual spec 
    bucket: "<BUCKET_NAME>"
    region: "<REGION>"
    # Optional parameters
    # path_prefix: ""
    # rows_per_record: 500
    # concurrency: 50

S3 spec #

This is the (nested) spec used by the S3 source plugin.
  • bucket (string) (required)
    The name of the S3 bucket that will be synced.
  • region (string) (required)
    The AWS region of the S3 bucket.
  • path_prefix (string) (optional) (default: "")
    The path prefix that will limit the files to sync.
  • filetype (string) (optional) (default: parquet)
    Type of file that will be synced. Currently only parquet is supported.
  • rows_per_record (integer) (optional) (default: 500)
    Amount of rows to be packed into a single Apache Arrow record to be sent over the wire during sync.
  • concurrency (integer) (optional) (default: 50)
    Number of objects to sync in parallel. Negative values mean no limit.


Join our mailing list

Subscribe to our newsletter to make sure you don't miss any updates.

Legal

© 2024 CloudQuery, Inc. All rights reserved.

We use tracking cookies to understand how you use the product and help us improve it. Please accept cookies to help us improve. You can always opt out later via the link in the footer.