Official

Premium

S3

The CloudQuery S3 source plugin reads parquet files and loads them into any supported CloudQuery destination (e.g. PostgreSQL, BigQuery, Snowflake, and more)

Publisher

cloudquery

Latest version

v1.8.15

Type

Source

Platforms

Date Published

Download CloudQuery CLI

Documentation Tables Changelog Destinations

Overview Licenses

Overview #

The CloudQuery S3 source plugin reads parquet files and loads them into any supported CloudQuery destination (e.g. PostgreSQL, BigQuery, Snowflake, and more). The S3 source plugin assumes that all unique prefixes in the S3 bucket are unique tables, and for those objects in the root of the bucket, the table name is the name of the object. For example if you have the following objects in your s3 bucket:

s3://<bucket>/datafile_0.parquet
s3://<bucket>/datafile_1.parquet
s3://<bucket>/data/2024/datafile_1.parquet
s3://<bucket>/data/2024/02/14/14/15/datafile_3.parquet
s3://<bucket>/data/2024/02/14/14/15/datafile_4.parquet

CloudQuery will sync the following tables:

datafile_0.parquet --> datafile_0
datafile_1.parquet --> datafile_1
data/2024/datafile_1.parquet --> data_2024
data/2024/02/14/14/15/datafile_3.parquet --> data_2024_02_14_14_15
data/2024/02/14/14/15/datafile_4.parquet --> data_2024_02_14_14_15

Authentication #

The plugin needs to be authenticated with your account(s) in order to read from your S3 bucket.

The plugin requires s3:GetObject and s3:ListBucket permissions on the bucket and objects that you are trying to sync.

There are multiple ways to authenticate with AWS, and the plugin respects the AWS credential provider chain. This means that CloudQuery will follow the following priorities when attempting to authenticate:

The AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN environment variables.
The credentials and config files in ~/.aws (the credentials file takes priority).
You can also use aws sso to authenticate cloudquery - you can read more about it here.
IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers).

You can read more about AWS authentication here and here.

Environment Variables #

CloudQuery can use the credentials from the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN environment variables (AWS_SESSION_TOKEN can be optional for some accounts). For information on obtaining credentials, see the AWS guide.

To export the environment variables (On Linux/Mac - similar for Windows):

export AWS_ACCESS_KEY_ID='{Your AWS Access Key ID}'
export AWS_SECRET_ACCESS_KEY='{Your AWS secret access key}'
export AWS_SESSION_TOKEN='{Your AWS session token}'

Shared Configuration files #

The plugin can use credentials from your credentials and config files in the .aws directory in your home folder. The contents of these files are practically interchangeable, but CloudQuery will prioritize credentials in the credentials file.

For information about obtaining credentials, see the AWS guide.

Here are example contents for a credentials file:

[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

You can also specify credentials for a different profile, and instruct CloudQuery to use the credentials from this profile instead of the default one.

For example:

[myprofile]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

Then, you can either export the AWS_PROFILE environment variable (On Linux/Mac, similar for Windows):

export AWS_PROFILE=myprofile

IAM Roles for AWS Compute Resources #

The plugin can use IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers). If you configured your AWS compute resources with IAM, the plugin will use these roles automatically. For more information on configuring IAM, see the AWS docs here and here.

User Credentials with MFA #

In order to leverage IAM User credentials with MFA, the STS "get-session-token" command may be used with the IAM User's long-term security credentials (Access Key and Secret Access Key). For more information, see here.

aws sts get-session-token --serial-number <YOUR_MFA_SERIAL_NUMBER> --token-code <YOUR_MFA_TOKEN_CODE> --duration-seconds 3600

Then export the temporary credentials to your environment variables.

export AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
export AWS_SESSION_TOKEN=<YOUR_SESSION_TOKEN>

Incremental Syncing #

The S3 plugin supports incremental syncing. This means that only new files will be fetched from S3 and loaded into your destination. This is done by keeping track of the time of the last sync and comparing it against the last modified date of each file to only fetch new files. This assumes that S3 files are immutable. To enable this, backend_options must be set in the spec (as shown below). This is documented in the Managing Incremental Tables section.

Configuration #

kind: source
spec:
  name: s3

  path: cloudquery/s3
  registry: cloudquery
  version: "v1.8.15"
  tables: ["*"]
  destinations: ["postgresql"]
  
  backend_options:
    table_name: "cq_state_s3"
    connection: "@@plugins.postgresql.connection"

  # Learn more about the configuration options at https://cql.ink/s3_source
  spec:
    # TODO: Update it with the actual spec 
    bucket: "<BUCKET_NAME>"
    region: "<REGION>"
    # Optional parameters
    # path_prefix: ""
    # rows_per_record: 500
    # concurrency: 50

Note that if backend_options is omitted, by default no backend will be used. This will result in all items being fetched on every sync.

For more information about managing state for incremental tables, see Managing Incremental Tables.

S3 spec #

This is the (nested) spec used by the S3 source plugin.

bucket (string) (required)
The name of the S3 bucket that will be synced.
region (string) (required)
The AWS region of the S3 bucket.
local_profile (string) (optional) (default: will use current credentials)
Local profile to use to authenticate this account with. Please note this should be set to the name of the profile.
For example, with the following credentials file:
```
[default]
aws_access_key_id=xxxx
aws_secret_access_key=xxxx

[user1]
aws_access_key_id=xxxx
aws_secret_access_key=xxxx
```
local_profile should be set to either default or user1.
path_prefix (string) (optional) (default: "")
The path prefix that will limit the files to sync.
filetype (string) (optional) (default: parquet)
Type of file that will be synced. Currently only parquet is supported.
rows_per_record (integer) (optional) (default: 500)
Amount of rows to be packed into a single Apache Arrow record to be sent over the wire during sync.
concurrency (integer) (optional) (default: 50)
Number of objects to sync in parallel. Negative values mean no limit.

Licenses #

The following tools / packages are used in this plugin:

Name	License
github.com/adrg/xdg	MIT
github.com/andybalholm/brotli	MIT
github.com/apache/arrow-go/v18	Apache-2.0
github.com/apache/arrow/go/v13	Apache-2.0
github.com/apache/thrift/lib/go/thrift	Apache-2.0
github.com/apapsch/go-jsonmerge/v2	MIT
github.com/aws/aws-sdk-go-v2	Apache-2.0
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream	Apache-2.0
github.com/aws/aws-sdk-go-v2/config	Apache-2.0
github.com/aws/aws-sdk-go-v2/credentials	Apache-2.0
github.com/aws/aws-sdk-go-v2/feature/ec2/imds	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/configsources	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/ini	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/sync/singleflight	BSD-3-Clause
github.com/aws/aws-sdk-go-v2/internal/v4a	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/checksum	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/s3shared	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/licensemanager	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/marketplacemetering	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/s3	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/sso	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/ssooidc	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/sts	Apache-2.0
github.com/aws/smithy-go	Apache-2.0
github.com/aws/smithy-go/internal/sync/singleflight	BSD-3-Clause
github.com/cenkalti/backoff/v4	MIT
github.com/cloudquery/cloudquery-api-go	MPL-2.0
github.com/cloudquery/plugin-pb-go	MPL-2.0
github.com/cloudquery/plugin-sdk/v2/internal/glob	MIT
github.com/cloudquery/plugin-sdk/v2/schema	MIT
github.com/cloudquery/plugin-sdk/v2/types	MPL-2.0
github.com/cloudquery/plugin-sdk/v4	MPL-2.0
github.com/cloudquery/plugin-sdk/v4/glob	MIT
github.com/cloudquery/plugin-sdk/v4/scalar	MIT
github.com/davecgh/go-spew/spew	ISC
github.com/ghodss/yaml	MIT
github.com/go-logr/logr	Apache-2.0
github.com/go-logr/stdr	Apache-2.0
github.com/goccy/go-json	MIT
github.com/golang/snappy	BSD-3-Clause
github.com/google/flatbuffers/go	Apache-2.0
github.com/google/uuid	BSD-3-Clause
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors	Apache-2.0
github.com/grpc-ecosystem/grpc-gateway/v2	BSD-3-Clause
github.com/hashicorp/go-cleanhttp	MPL-2.0
github.com/hashicorp/go-retryablehttp	MPL-2.0
github.com/klauspost/compress	Apache-2.0
github.com/klauspost/compress/internal/snapref	BSD-3-Clause
github.com/klauspost/compress/zstd/internal/xxhash	MIT
github.com/klauspost/cpuid/v2	MIT
github.com/mattn/go-colorable	MIT
github.com/mattn/go-isatty	MIT
github.com/oapi-codegen/runtime	Apache-2.0
github.com/pierrec/lz4/v4	BSD-3-Clause
github.com/pmezard/go-difflib/difflib	BSD-3-Clause
github.com/rs/zerolog	MIT
github.com/samber/lo	MIT
github.com/santhosh-tekuri/jsonschema/v6	Apache-2.0
github.com/spf13/cobra	Apache-2.0
github.com/spf13/pflag	BSD-3-Clause
github.com/stretchr/testify	MIT
github.com/thoas/go-funk	MIT
github.com/zeebo/xxh3	BSD-2-Clause
go.opentelemetry.io/auto/sdk	Apache-2.0
go.opentelemetry.io/otel	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploghttp	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp	Apache-2.0
go.opentelemetry.io/otel/log	Apache-2.0
go.opentelemetry.io/otel/metric	Apache-2.0
go.opentelemetry.io/otel/sdk	Apache-2.0
go.opentelemetry.io/otel/sdk/log	Apache-2.0
go.opentelemetry.io/otel/sdk/metric	Apache-2.0
go.opentelemetry.io/otel/trace	Apache-2.0
go.opentelemetry.io/proto/otlp	Apache-2.0
golang.org/x/exp	BSD-3-Clause
golang.org/x/net	BSD-3-Clause
golang.org/x/sync/errgroup	BSD-3-Clause
golang.org/x/sys	BSD-3-Clause
golang.org/x/text	BSD-3-Clause
golang.org/x/xerrors	BSD-3-Clause
google.golang.org/genproto/googleapis/api/httpbody	Apache-2.0
google.golang.org/genproto/googleapis/rpc/status	Apache-2.0
google.golang.org/grpc	Apache-2.0
google.golang.org/protobuf	BSD-3-Clause
gopkg.in/yaml.v2	Apache-2.0
gopkg.in/yaml.v3	MIT

Loading plugin documentation

Overview #

s3://<bucket>/datafile_0.parquet
s3://<bucket>/datafile_1.parquet
s3://<bucket>/data/2024/datafile_1.parquet
s3://<bucket>/data/2024/02/14/14/15/datafile_3.parquet
s3://<bucket>/data/2024/02/14/14/15/datafile_4.parquet

CloudQuery will sync the following tables:

datafile_0.parquet --> datafile_0
datafile_1.parquet --> datafile_1
data/2024/datafile_1.parquet --> data_2024
data/2024/02/14/14/15/datafile_3.parquet --> data_2024_02_14_14_15
data/2024/02/14/14/15/datafile_4.parquet --> data_2024_02_14_14_15

Authentication #

The plugin needs to be authenticated with your account(s) in order to read from your S3 bucket.

The plugin requires s3:GetObject and s3:ListBucket permissions on the bucket and objects that you are trying to sync.

The AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN environment variables.
The credentials and config files in ~/.aws (the credentials file takes priority).
You can also use aws sso to authenticate cloudquery - you can read more about it here.
IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers).

You can read more about AWS authentication here and here.

Environment Variables #

To export the environment variables (On Linux/Mac - similar for Windows):

export AWS_ACCESS_KEY_ID='{Your AWS Access Key ID}'
export AWS_SECRET_ACCESS_KEY='{Your AWS secret access key}'
export AWS_SESSION_TOKEN='{Your AWS session token}'

Shared Configuration files #

For information about obtaining credentials, see the AWS guide.

Here are example contents for a credentials file:

[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

You can also specify credentials for a different profile, and instruct CloudQuery to use the credentials from this profile instead of the default one.

For example:

[myprofile]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

Then, you can either export the AWS_PROFILE environment variable (On Linux/Mac, similar for Windows):

export AWS_PROFILE=myprofile

IAM Roles for AWS Compute Resources #

User Credentials with MFA #

aws sts get-session-token --serial-number <YOUR_MFA_SERIAL_NUMBER> --token-code <YOUR_MFA_TOKEN_CODE> --duration-seconds 3600

Then export the temporary credentials to your environment variables.

export AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
export AWS_SESSION_TOKEN=<YOUR_SESSION_TOKEN>

Incremental Syncing #

Configuration #

kind: source
spec:
  name: s3

  path: cloudquery/s3
  registry: cloudquery
  version: "v1.8.15"
  tables: ["*"]
  destinations: ["postgresql"]
  
  backend_options:
    table_name: "cq_state_s3"
    connection: "@@plugins.postgresql.connection"

  # Learn more about the configuration options at https://cql.ink/s3_source
  spec:
    # TODO: Update it with the actual spec 
    bucket: "<BUCKET_NAME>"
    region: "<REGION>"
    # Optional parameters
    # path_prefix: ""
    # rows_per_record: 500
    # concurrency: 50

Note that if backend_options is omitted, by default no backend will be used. This will result in all items being fetched on every sync.

For more information about managing state for incremental tables, see Managing Incremental Tables.

S3 spec #

This is the (nested) spec used by the S3 source plugin.

bucket (string) (required)
The name of the S3 bucket that will be synced.
region (string) (required)
The AWS region of the S3 bucket.
local_profile (string) (optional) (default: will use current credentials)
Local profile to use to authenticate this account with. Please note this should be set to the name of the profile.
For example, with the following credentials file:
```
[default]
aws_access_key_id=xxxx
aws_secret_access_key=xxxx

[user1]
aws_access_key_id=xxxx
aws_secret_access_key=xxxx
```
local_profile should be set to either default or user1.
path_prefix (string) (optional) (default: "")
The path prefix that will limit the files to sync.
filetype (string) (optional) (default: parquet)
Type of file that will be synced. Currently only parquet is supported.
rows_per_record (integer) (optional) (default: 500)
Amount of rows to be packed into a single Apache Arrow record to be sent over the wire during sync.
concurrency (integer) (optional) (default: 50)
Number of objects to sync in parallel. Negative values mean no limit.

Licenses #

The following tools / packages are used in this plugin:

Name	License
github.com/adrg/xdg	MIT
github.com/andybalholm/brotli	MIT
github.com/apache/arrow-go/v18	Apache-2.0
github.com/apache/arrow/go/v13	Apache-2.0
github.com/apache/thrift/lib/go/thrift	Apache-2.0
github.com/apapsch/go-jsonmerge/v2	MIT
github.com/aws/aws-sdk-go-v2	Apache-2.0
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream	Apache-2.0
github.com/aws/aws-sdk-go-v2/config	Apache-2.0
github.com/aws/aws-sdk-go-v2/credentials	Apache-2.0
github.com/aws/aws-sdk-go-v2/feature/ec2/imds	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/configsources	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/ini	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/sync/singleflight	BSD-3-Clause
github.com/aws/aws-sdk-go-v2/internal/v4a	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/checksum	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/s3shared	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/licensemanager	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/marketplacemetering	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/s3	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/sso	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/ssooidc	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/sts	Apache-2.0
github.com/aws/smithy-go	Apache-2.0
github.com/aws/smithy-go/internal/sync/singleflight	BSD-3-Clause
github.com/cenkalti/backoff/v4	MIT
github.com/cloudquery/cloudquery-api-go	MPL-2.0
github.com/cloudquery/plugin-pb-go	MPL-2.0
github.com/cloudquery/plugin-sdk/v2/internal/glob	MIT
github.com/cloudquery/plugin-sdk/v2/schema	MIT
github.com/cloudquery/plugin-sdk/v2/types	MPL-2.0
github.com/cloudquery/plugin-sdk/v4	MPL-2.0
github.com/cloudquery/plugin-sdk/v4/glob	MIT
github.com/cloudquery/plugin-sdk/v4/scalar	MIT
github.com/davecgh/go-spew/spew	ISC
github.com/ghodss/yaml	MIT
github.com/go-logr/logr	Apache-2.0
github.com/go-logr/stdr	Apache-2.0
github.com/goccy/go-json	MIT
github.com/golang/snappy	BSD-3-Clause
github.com/google/flatbuffers/go	Apache-2.0
github.com/google/uuid	BSD-3-Clause
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors	Apache-2.0
github.com/grpc-ecosystem/grpc-gateway/v2	BSD-3-Clause
github.com/hashicorp/go-cleanhttp	MPL-2.0
github.com/hashicorp/go-retryablehttp	MPL-2.0
github.com/klauspost/compress	Apache-2.0
github.com/klauspost/compress/internal/snapref	BSD-3-Clause
github.com/klauspost/compress/zstd/internal/xxhash	MIT
github.com/klauspost/cpuid/v2	MIT
github.com/mattn/go-colorable	MIT
github.com/mattn/go-isatty	MIT
github.com/oapi-codegen/runtime	Apache-2.0
github.com/pierrec/lz4/v4	BSD-3-Clause
github.com/pmezard/go-difflib/difflib	BSD-3-Clause
github.com/rs/zerolog	MIT
github.com/samber/lo	MIT
github.com/santhosh-tekuri/jsonschema/v6	Apache-2.0
github.com/spf13/cobra	Apache-2.0
github.com/spf13/pflag	BSD-3-Clause
github.com/stretchr/testify	MIT
github.com/thoas/go-funk	MIT
github.com/zeebo/xxh3	BSD-2-Clause
go.opentelemetry.io/auto/sdk	Apache-2.0
go.opentelemetry.io/otel	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploghttp	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp	Apache-2.0
go.opentelemetry.io/otel/log	Apache-2.0
go.opentelemetry.io/otel/metric	Apache-2.0
go.opentelemetry.io/otel/sdk	Apache-2.0
go.opentelemetry.io/otel/sdk/log	Apache-2.0
go.opentelemetry.io/otel/sdk/metric	Apache-2.0
go.opentelemetry.io/otel/trace	Apache-2.0
go.opentelemetry.io/proto/otlp	Apache-2.0
golang.org/x/exp	BSD-3-Clause
golang.org/x/net	BSD-3-Clause
golang.org/x/sync/errgroup	BSD-3-Clause
golang.org/x/sys	BSD-3-Clause
golang.org/x/text	BSD-3-Clause
golang.org/x/xerrors	BSD-3-Clause
google.golang.org/genproto/googleapis/api/httpbody	Apache-2.0
google.golang.org/genproto/googleapis/rpc/status	Apache-2.0
google.golang.org/grpc	Apache-2.0
google.golang.org/protobuf	BSD-3-Clause
gopkg.in/yaml.v2	Apache-2.0
gopkg.in/yaml.v3	MIT

CloudQuery

Test CloudQuery's capabilities with a demo

Learning Center

S3

Overview #

Authentication #

Environment Variables #

Shared Configuration files #

IAM Roles for AWS Compute Resources #

User Credentials with MFA #

Incremental Syncing #

Configuration #

S3 spec #

Licenses #

Overview #

Authentication #

Environment Variables #

Shared Configuration files #

IAM Roles for AWS Compute Resources #

User Credentials with MFA #

Incremental Syncing #

Configuration #

S3 spec #

Licenses #