Official

S3 destination integration documentation

This destination plugin lets you sync data from a CloudQuery source to remote S3 storage in various formats such as CSV, JSON and Parquet

Publisher

cloudquery

Repository

github.com

Latest version

v7.9.8

Type

Destination

Platforms

Date Published

Start Syncing

Documentation Changelog

Overview Authentication Auth Details Licenses

Overview #

This destination plugin lets you sync data from a CloudQuery source to remote S3 storage in various formats such as CSV, JSON and Parquet.

This is useful in various use-cases, especially in data lakes where you can query the data direct from Athena or load it to various data warehouses such as BigQuery, RedShift, Snowflake and others.

Example #

This example uses the parquet format, to create parquet files in s3://bucket_name/path/to/files, with each table placed in its own directory.

The (top level) spec section is described in the Destination Spec Reference.

kind: destination
spec:
  name: "s3"
  path: "cloudquery/s3"
  registry: "cloudquery"
  version: "v7.9.8"
  write_mode: "append"
  send_sync_summary: true
  # Learn more about the configuration options at https://cql.ink/s3_destination
  spec:
    bucket: "bucket_name"
    region: "region-name" # Example: us-east-1
    path: "path/to/files/{{TABLE}}/{{UUID}}.{{FORMAT}}"
    format: "parquet" # options: parquet, json, csv
    format_spec:
      # CSV specific parameters:
      # delimiter: ","
      # skip_header: false
      # Parquet specific parameters:
      # version: "v2Latest"
      # root_repetition: "repeated"
      # max_row_group_length: 134217728 # 128 * 1024 * 1024

    # Optional parameters
    # compression: "" # options: gzip
    # no_rotate: false
    # athena: false # <- set this to true for Athena compatibility
    # write_empty_objects_for_empty_tables: false # <- set this to true if using with the CloudQuery Compliance policies
    # test_write: true # tests the ability to write to the bucket before processing the data
    # endpoint: "" # Endpoint to use for S3 API calls.
    # endpoint_skip_tls_verify # Disable TLS verification if using an untrusted certificate
    # use_path_style: false
    # batch_size: 10000 # 10K entries
    # batch_size_bytes: 52428800 # 50 MiB
    # batch_timeout: 30s # 30 seconds
    # max_retries: 3 # 3 retries
    # max_backoff: 30 # 30 seconds
    # part_size: 5242880 # 5 MiB
    # aws_debug: true
    # credentials: # <- Use this to specify non-default role assumption parameters
    #   local_profile: "s3-profile" # Use a local profile instead of the default one
    #   role_arn: "arn:aws:iam::123456789012:role/role_name" # Specify the role to assume
    #   external_id: "external_id" # Used when assuming a role
    #   role_session_name: "session_name" # Used when assuming a role

It is also possible to use {{YEAR}}, {{MONTH}}, {{DAY}} and {{HOUR}} in the path to create a directory structure based on the current time. For example:

path: "path/to/files/{{TABLE}}/dt={{YEAR}}-{{MONTH}}-{{DAY}}/{{UUID}}.parquet"

Other supported formats are json and csv.

Note that the S3 plugin only supports append write_mode. The (top level) spec section is described in the Destination Spec Reference.

The S3 destination utilizes batching, and supports batch_size, batch_size_bytes and batch_timeout options (see below).

S3 Spec #

This is the (nested) spec used by the CSV destination Plugin.

bucket (string) (required)
Bucket where to sync the files.
region (string) (required)
Region where bucket is located.
credentials (credentials) (optional)
Optional parameters to enable non-default credentials, to authenticate with the S3 API
path (string) (required)
Path to where the files will be uploaded in the above bucket, for example path/to/files/{{TABLE}}/{{UUID}}.parquet.
The path supports the following placeholder variables:
- {{TABLE}} will be replaced with the table name
- {{TABLE_HYPHEN}} will be replaced with the table name with hyphens instead of underscores.
- {{SYNC_ID}} will be replaced with the unique identifier of the sync. This value is a UUID and is randomly generated for each sync.
- {{FORMAT}} will be replaced with the file format, such as csv, json or parquet. If compression is enabled, the format will be csv.gz, json.gz etc.
- {{UUID}} will be replaced with a random UUID to uniquely identify each file
- {{YEAR}} will be replaced with the current year in YYYY format
- {{MONTH}} will be replaced with the current month in MM format
- {{DAY}} will be replaced with the current day in DD format
- {{HOUR}} will be replaced with the current hour in HH format
- {{MINUTE}} will be replaced with the current minute in mm format
Note that timestamps are in UTC and will be the current time at the time the file is written, not when the sync started.
format (string) (required)
Format of the output file. Supported values are csv, json and parquet.
format_spec (format_spec) (optional)
Optional parameters to change the format of the file.
server_side_encryption_configuration (server_side_encryption_configuration) (optional)
Optional parameters to enable server-side encryption.
compression (string) (optional) (default: "")
Compression algorithm to use. Supported values are "" or gzip. Not supported for parquet format.
no_rotate (boolean) (optional) (default: false)
If set to true, the plugin will write to one file per table. Otherwise, for every batch a new file will be created with a different .<UUID> suffix.
athena (boolean) (optional) (default: false)
When athena is set to true, the S3 plugin will sanitize keys in JSON columns to be compatible with the Hive Metastore / Athena. This allows tables to be created with a Glue Crawler and then queried via Athena, without changes to the table schema.
write_empty_objects_for_empty_tables (boolean) (optional) (default: false)
By default only tables with resources are persisted to objects during the sync. If you'd like to persist empty objects for empty tables enable this option. Useful when using CloudQuery Compliance policies to ensure all tables have their schema populated by a query engine like Athena
test_write (boolean) (optional) (default: true)
Ensure write access to the given bucket and path by writing a test object on each sync. If you are sure that the bucket and path are writable, you can set this to false to skip the test.
endpoint (string) (optional) (default: "")
Endpoint to use for S3 API calls. This is useful for S3-compatible storage services such as MinIO. Note: if you want to use path-style addressing, i.e., https://s3.amazonaws.com/BUCKET/KEY, use_path_style should be enabled, too.
acl (string) (optional) (default: "")
Canned ACL to apply to the object. Supported values are private, public-read, public-read-write, authenticated-read, aws-exec-read, bucket-owner-read, bucket-owner-full-control.
endpoint_skip_tls_verify (boolean) (optional) (default: false)
Disable TLS verification for requests to your S3 endpoint.
This option is intended to be used when using a custom endpoint using the endpoint option.
use_path_style (boolean) (optional) (default: false)
Allows to use path-style addressing in the endpoint option, i.e., https://s3.amazonaws.com/BUCKET/KEY. By default, the S3 client will use virtual hosted bucket addressing when possible (https://BUCKET.s3.amazonaws.com/KEY).
batch_size (integer) (optional) (default: 10000)
Number of records to write before starting a new object.
batch_size_bytes (integer) (optional) (default: 52428800 (= 50 MiB))
Number of bytes (as Arrow buffer size) to write before starting a new object.
batch_timeout (duration) (optional) (default: 30s (30 seconds))
Maximum interval between batch writes.

format_spec #

CSV

delimiter (string) (optional) (default: ,)
Delimiter to use in the CSV file.
skip_header (boolean) (optional) (default: false)
If set to true, the CSV file will not contain a header row as the first row.

JSON

Reserved for future use.

Parquet

version (string) (optional) (default: v2Latest)
Parquet format version to use. Supported values are v1.0, v2.4, v2.6 and v2Latest. v2Latest is an alias for the latest version available in the Parquet library which is currently v2.6.
Useful when the reader consuming the Parquet files does not support the latest version.
root_repetition (string) (optional) (default: repeated)
Repetition option to use for the root node. Supported values are undefined, required, optional and repeated.
Some Parquet readers require a specific root repetition option to be able to read the file. For example, importing Parquet files into Snowflake requires the root repetition to be undefined.
max_row_group_length (integer) (optional) (default: 134217728 (= 128 _ 1024 _ 1024))
The maximum number of rows in a single row group. Use a lower number to reduce memory usage when reading the Parquet files, and a higher number to increase the efficiency of reading the Parquet files.

server_side_encryption_configuration #

sse_kms_key_id (string) (required)
KMS Key ID appended to S3 API calls header. Used in conjunction with server_side_encryption.
server_side_encryption (string) (required)
The server-side encryption algorithm used when storing the object in S3. Supported values are AES256, aws:kms and aws:kms:dsse.

credentials #

local_profile (string) (default: will use current credentials)
Local profile to use to authenticate this account with. Please note this should be set to the name of the profile.
For example, with the following credentials file:
```
[default]
aws_access_key_id=xxxx
aws_secret_access_key=xxxx

[user1]
aws_access_key_id=xxxx
aws_secret_access_key=xxxx
```
local_profile should be set to either default or user1.
role_arn (string)
If specified will use this to assume role.
role_session_name (string)
If specified will use this session name when assume role to role_arn.
external_id (string)
If specified will use this when assuming role to role_arn.

CSV

delimiter (string) (optional) (default: ,)
Delimiter to use in the CSV file.
skip_header (boolean) (optional) (default: false)
If set to true, the CSV file will not contain a header row as the first row.

Authentication #

Quickstart #

The plugin needs to be authenticated with your AWS account(s) in order to sync information from your cloud setup.

The plugin requires only PutObject permissions (we will never make any changes to your cloud setup), so, following the principle of least privilege, it's recommended to grant it PutObject permissions.

If you are running CloudQuery CLI locally, and have AWS CLI installed, sign in with AWS CLI.

Test that your AWS CLI is working:

aws account list-regions

Non-interactive Authentication #

There are multiple ways to authenticate with AWS, and the plugin respects the AWS credential provider chain. This means that AWS plugin will follow the following priorities when attempting to authenticate:

The AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN environment variables - see the AWS guide.
The credentials and config files in ~/.aws (the credentials file takes priority)
A session created using the aws sso to authenticate the plugin - see Configuring IAM Identity Center authentication with the AWS CLI
IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers)

For details about configuring the individual authentication options, see Advanced Authentication Configuration.

There are multiple ways to authenticate with AWS, and the plugin respects the AWS credential provider chain. This means that CloudQuery will follow the following priorities when attempting to authenticate:

The AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN environment variables.
The credentials and config files in ~/.aws (the credentials file takes priority).
You can also use aws sso to authenticate cloudquery - you can read more about it here.
IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers).

Auth Details #

Required Permissions #

Environment Variables #

CloudQuery can use the credentials from the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN environment variables (AWS_SESSION_TOKEN can be optional for some accounts). For information on obtaining credentials, see the AWS guide. To export the environment variables (On Linux/Mac - similar for Windows):

export AWS_ACCESS_KEY_ID='{Your AWS Access Key ID}'
export AWS_SECRET_ACCESS_KEY='{Your AWS secret access key}'
export AWS_SESSION_TOKEN='{Your AWS session token}'

Shared Configuration files #

The plugin can use credentials from your credentials and config files in the .aws directory in your home folder. The contents of these files are practically interchangeable, but CloudQuery will prioritize credentials in the credentials file. For information about obtaining credentials, see the AWS guide. Here are example contents for a credentials file:

[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

You can also specify credentials for a different profile, and instruct CloudQuery to use the credentials from this profile instead of the default one. For example:

[myprofile]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

Then, you can either export the AWS_PROFILE environment variable (On Linux/Mac, similar for Windows):

export AWS_PROFILE=myprofile

or, configure your desired profile in the local_profile field:

accounts:
  id: <account_alias>
  local_profile: myprofile

IAM Roles for AWS Compute Resources #

The plugin can use IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers). If you configured your AWS compute resources with IAM, the plugin will use these roles automatically. For more information on configuring IAM, see the AWS docs here and here.

User Credentials with MFA #

In order to leverage IAM User credentials with MFA, the STS "get-session-token" command may be used with the IAM User's long-term security credentials (Access Key and Secret Access Key). For more information, see here.

aws sts get-session-token --serial-number <YOUR_MFA_SERIAL_NUMBER> --token-code <YOUR_MFA_TOKEN_CODE> --duration-seconds 3600

Then export the temporary credentials to your environment variables.

export AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
export AWS_SESSION_TOKEN=<YOUR_SESSION_TOKEN>

Licenses #

The following tools / packages are used in this plugin:

Name	License
github.com/JohnCGriffin/overflow	MIT
github.com/adrg/xdg	MIT
github.com/andybalholm/brotli	MIT
github.com/apache/arrow/go/v13	Apache-2.0
github.com/apache/arrow-go/v18	Apache-2.0
github.com/apache/thrift/lib/go/thrift	Apache-2.0
github.com/apapsch/go-jsonmerge/v2	MIT
github.com/aws/aws-sdk-go-v2	Apache-2.0
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream	Apache-2.0
github.com/aws/aws-sdk-go-v2/config	Apache-2.0
github.com/aws/aws-sdk-go-v2/credentials	Apache-2.0
github.com/aws/aws-sdk-go-v2/feature/ec2/imds	Apache-2.0
github.com/aws/aws-sdk-go-v2/feature/s3/manager	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/configsources	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/ini	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/sync/singleflight	BSD-3-Clause
github.com/aws/aws-sdk-go-v2/internal/v4a	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/checksum	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/s3shared	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/licensemanager	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/marketplacemetering	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/s3	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/sso	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/ssooidc	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/sts	Apache-2.0
github.com/aws/smithy-go	Apache-2.0
github.com/aws/smithy-go/internal/sync/singleflight	BSD-3-Clause
github.com/bahlo/generic-list-go	BSD-3-Clause
github.com/buger/jsonparser	MIT
github.com/cenkalti/backoff/v4	MIT
github.com/cloudquery/cloudquery-api-go	MPL-2.0
github.com/cloudquery/codegen/jsonschema	MPL-2.0
github.com/cloudquery/plugin-pb-go	MPL-2.0
github.com/cloudquery/plugin-sdk/v2/internal/glob	MIT
github.com/cloudquery/plugin-sdk/v2/schema	MIT
github.com/cloudquery/plugin-sdk/v2/types	MPL-2.0
github.com/cloudquery/plugin-sdk/v4	MPL-2.0
github.com/cloudquery/plugin-sdk/v4/glob	MIT
github.com/cloudquery/plugin-sdk/v4/scalar	MIT
github.com/davecgh/go-spew/spew	ISC
github.com/ghodss/yaml	MIT
github.com/go-logr/logr	Apache-2.0
github.com/go-logr/stdr	Apache-2.0
github.com/goccy/go-json	MIT
github.com/golang/snappy	BSD-3-Clause
github.com/google/flatbuffers/go	Apache-2.0
github.com/google/uuid	BSD-3-Clause
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors	Apache-2.0
github.com/grpc-ecosystem/grpc-gateway/v2	BSD-3-Clause
github.com/hashicorp/go-cleanhttp	MPL-2.0
github.com/hashicorp/go-retryablehttp	MPL-2.0
github.com/huandu/xstrings	MIT
github.com/invopop/jsonschema	MIT
github.com/klauspost/compress	Apache-2.0
github.com/klauspost/compress/internal/snapref	BSD-3-Clause
github.com/klauspost/compress/zstd/internal/xxhash	MIT
github.com/klauspost/cpuid/v2	MIT
github.com/mailru/easyjson	MIT
github.com/mattn/go-colorable	MIT
github.com/mattn/go-isatty	MIT
github.com/oapi-codegen/runtime	Apache-2.0
github.com/pierrec/lz4/v4	BSD-3-Clause
github.com/pmezard/go-difflib/difflib	BSD-3-Clause
github.com/rs/zerolog	MIT
github.com/santhosh-tekuri/jsonschema/v6	Apache-2.0
github.com/spf13/cobra	Apache-2.0
github.com/spf13/pflag	BSD-3-Clause
github.com/stretchr/testify	MIT
github.com/thoas/go-funk	MIT
github.com/wk8/go-ordered-map/v2	Apache-2.0
github.com/zeebo/xxh3	BSD-2-Clause
go.opentelemetry.io/otel	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploghttp	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp	Apache-2.0
go.opentelemetry.io/otel/log	Apache-2.0
go.opentelemetry.io/otel/metric	Apache-2.0
go.opentelemetry.io/otel/sdk	Apache-2.0
go.opentelemetry.io/otel/sdk/log	Apache-2.0
go.opentelemetry.io/otel/sdk/metric	Apache-2.0
go.opentelemetry.io/otel/trace	Apache-2.0
go.opentelemetry.io/proto/otlp	Apache-2.0
golang.org/x/exp	BSD-3-Clause
golang.org/x/net	BSD-3-Clause
golang.org/x/sync/errgroup	BSD-3-Clause
golang.org/x/sys	BSD-3-Clause
golang.org/x/text	BSD-3-Clause
golang.org/x/xerrors	BSD-3-Clause
google.golang.org/genproto/googleapis/api/httpbody	Apache-2.0
google.golang.org/genproto/googleapis/rpc/status	Apache-2.0
google.golang.org/grpc	Apache-2.0
google.golang.org/protobuf	BSD-3-Clause
gopkg.in/yaml.v2	Apache-2.0
gopkg.in/yaml.v3	MIT

Loading plugin documentation

Overview #

This destination plugin lets you sync data from a CloudQuery source to remote S3 storage in various formats such as CSV, JSON and Parquet.

This is useful in various use-cases, especially in data lakes where you can query the data direct from Athena or load it to various data warehouses such as BigQuery, RedShift, Snowflake and others.

Example #

This example uses the parquet format, to create parquet files in s3://bucket_name/path/to/files, with each table placed in its own directory.

The (top level) spec section is described in the Destination Spec Reference.

kind: destination
spec:
  name: "s3"
  path: "cloudquery/s3"
  registry: "cloudquery"
  version: "v7.9.8"
  write_mode: "append"
  send_sync_summary: true
  # Learn more about the configuration options at https://cql.ink/s3_destination
  spec:
    bucket: "bucket_name"
    region: "region-name" # Example: us-east-1
    path: "path/to/files/{{TABLE}}/{{UUID}}.{{FORMAT}}"
    format: "parquet" # options: parquet, json, csv
    format_spec:
      # CSV specific parameters:
      # delimiter: ","
      # skip_header: false
      # Parquet specific parameters:
      # version: "v2Latest"
      # root_repetition: "repeated"
      # max_row_group_length: 134217728 # 128 * 1024 * 1024

    # Optional parameters
    # compression: "" # options: gzip
    # no_rotate: false
    # athena: false # <- set this to true for Athena compatibility
    # write_empty_objects_for_empty_tables: false # <- set this to true if using with the CloudQuery Compliance policies
    # test_write: true # tests the ability to write to the bucket before processing the data
    # endpoint: "" # Endpoint to use for S3 API calls.
    # endpoint_skip_tls_verify # Disable TLS verification if using an untrusted certificate
    # use_path_style: false
    # batch_size: 10000 # 10K entries
    # batch_size_bytes: 52428800 # 50 MiB
    # batch_timeout: 30s # 30 seconds
    # max_retries: 3 # 3 retries
    # max_backoff: 30 # 30 seconds
    # part_size: 5242880 # 5 MiB
    # aws_debug: true
    # credentials: # <- Use this to specify non-default role assumption parameters
    #   local_profile: "s3-profile" # Use a local profile instead of the default one
    #   role_arn: "arn:aws:iam::123456789012:role/role_name" # Specify the role to assume
    #   external_id: "external_id" # Used when assuming a role
    #   role_session_name: "session_name" # Used when assuming a role

It is also possible to use {{YEAR}}, {{MONTH}}, {{DAY}} and {{HOUR}} in the path to create a directory structure based on the current time. For example:

path: "path/to/files/{{TABLE}}/dt={{YEAR}}-{{MONTH}}-{{DAY}}/{{UUID}}.parquet"

Other supported formats are json and csv.

Note that the S3 plugin only supports append write_mode. The (top level) spec section is described in the Destination Spec Reference.

The S3 destination utilizes batching, and supports batch_size, batch_size_bytes and batch_timeout options (see below).

S3 Spec #

This is the (nested) spec used by the CSV destination Plugin.

bucket (string) (required)
Bucket where to sync the files.
region (string) (required)
Region where bucket is located.
credentials (credentials) (optional)
Optional parameters to enable non-default credentials, to authenticate with the S3 API
path (string) (required)
Path to where the files will be uploaded in the above bucket, for example path/to/files/{{TABLE}}/{{UUID}}.parquet.
The path supports the following placeholder variables:
- {{TABLE}} will be replaced with the table name
- {{TABLE_HYPHEN}} will be replaced with the table name with hyphens instead of underscores.
- {{SYNC_ID}} will be replaced with the unique identifier of the sync. This value is a UUID and is randomly generated for each sync.
- {{FORMAT}} will be replaced with the file format, such as csv, json or parquet. If compression is enabled, the format will be csv.gz, json.gz etc.
- {{UUID}} will be replaced with a random UUID to uniquely identify each file
- {{YEAR}} will be replaced with the current year in YYYY format
- {{MONTH}} will be replaced with the current month in MM format
- {{DAY}} will be replaced with the current day in DD format
- {{HOUR}} will be replaced with the current hour in HH format
- {{MINUTE}} will be replaced with the current minute in mm format
Note that timestamps are in UTC and will be the current time at the time the file is written, not when the sync started.
format (string) (required)
Format of the output file. Supported values are csv, json and parquet.
format_spec (format_spec) (optional)
Optional parameters to change the format of the file.
server_side_encryption_configuration (server_side_encryption_configuration) (optional)
Optional parameters to enable server-side encryption.
compression (string) (optional) (default: "")
Compression algorithm to use. Supported values are "" or gzip. Not supported for parquet format.
no_rotate (boolean) (optional) (default: false)
If set to true, the plugin will write to one file per table. Otherwise, for every batch a new file will be created with a different .<UUID> suffix.
athena (boolean) (optional) (default: false)
When athena is set to true, the S3 plugin will sanitize keys in JSON columns to be compatible with the Hive Metastore / Athena. This allows tables to be created with a Glue Crawler and then queried via Athena, without changes to the table schema.
write_empty_objects_for_empty_tables (boolean) (optional) (default: false)
By default only tables with resources are persisted to objects during the sync. If you'd like to persist empty objects for empty tables enable this option. Useful when using CloudQuery Compliance policies to ensure all tables have their schema populated by a query engine like Athena
test_write (boolean) (optional) (default: true)
Ensure write access to the given bucket and path by writing a test object on each sync. If you are sure that the bucket and path are writable, you can set this to false to skip the test.
endpoint (string) (optional) (default: "")
Endpoint to use for S3 API calls. This is useful for S3-compatible storage services such as MinIO. Note: if you want to use path-style addressing, i.e., https://s3.amazonaws.com/BUCKET/KEY, use_path_style should be enabled, too.
acl (string) (optional) (default: "")
Canned ACL to apply to the object. Supported values are private, public-read, public-read-write, authenticated-read, aws-exec-read, bucket-owner-read, bucket-owner-full-control.
endpoint_skip_tls_verify (boolean) (optional) (default: false)
Disable TLS verification for requests to your S3 endpoint.
This option is intended to be used when using a custom endpoint using the endpoint option.
use_path_style (boolean) (optional) (default: false)
Allows to use path-style addressing in the endpoint option, i.e., https://s3.amazonaws.com/BUCKET/KEY. By default, the S3 client will use virtual hosted bucket addressing when possible (https://BUCKET.s3.amazonaws.com/KEY).
batch_size (integer) (optional) (default: 10000)
Number of records to write before starting a new object.
batch_size_bytes (integer) (optional) (default: 52428800 (= 50 MiB))
Number of bytes (as Arrow buffer size) to write before starting a new object.
batch_timeout (duration) (optional) (default: 30s (30 seconds))
Maximum interval between batch writes.

format_spec #

CSV

delimiter (string) (optional) (default: ,)
Delimiter to use in the CSV file.
skip_header (boolean) (optional) (default: false)
If set to true, the CSV file will not contain a header row as the first row.

JSON

Reserved for future use.

Parquet

version (string) (optional) (default: v2Latest)
Parquet format version to use. Supported values are v1.0, v2.4, v2.6 and v2Latest. v2Latest is an alias for the latest version available in the Parquet library which is currently v2.6.
Useful when the reader consuming the Parquet files does not support the latest version.
root_repetition (string) (optional) (default: repeated)
Repetition option to use for the root node. Supported values are undefined, required, optional and repeated.
Some Parquet readers require a specific root repetition option to be able to read the file. For example, importing Parquet files into Snowflake requires the root repetition to be undefined.
max_row_group_length (integer) (optional) (default: 134217728 (= 128 _ 1024 _ 1024))
The maximum number of rows in a single row group. Use a lower number to reduce memory usage when reading the Parquet files, and a higher number to increase the efficiency of reading the Parquet files.

server_side_encryption_configuration #

sse_kms_key_id (string) (required)
KMS Key ID appended to S3 API calls header. Used in conjunction with server_side_encryption.
server_side_encryption (string) (required)
The server-side encryption algorithm used when storing the object in S3. Supported values are AES256, aws:kms and aws:kms:dsse.

credentials #

local_profile (string) (default: will use current credentials)
Local profile to use to authenticate this account with. Please note this should be set to the name of the profile.
For example, with the following credentials file:
```
[default]
aws_access_key_id=xxxx
aws_secret_access_key=xxxx

[user1]
aws_access_key_id=xxxx
aws_secret_access_key=xxxx
```
local_profile should be set to either default or user1.
role_arn (string)
If specified will use this to assume role.
role_session_name (string)
If specified will use this session name when assume role to role_arn.
external_id (string)
If specified will use this when assuming role to role_arn.

CSV

delimiter (string) (optional) (default: ,)
Delimiter to use in the CSV file.
skip_header (boolean) (optional) (default: false)
If set to true, the CSV file will not contain a header row as the first row.

Authentication #

Quickstart #

The plugin needs to be authenticated with your AWS account(s) in order to sync information from your cloud setup.

If you are running CloudQuery CLI locally, and have AWS CLI installed, sign in with AWS CLI.

Test that your AWS CLI is working:

aws account list-regions

Non-interactive Authentication #

The AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN environment variables - see the AWS guide.
The credentials and config files in ~/.aws (the credentials file takes priority)
A session created using the aws sso to authenticate the plugin - see Configuring IAM Identity Center authentication with the AWS CLI
IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers)

For details about configuring the individual authentication options, see Advanced Authentication Configuration.

The AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN environment variables.
The credentials and config files in ~/.aws (the credentials file takes priority).
You can also use aws sso to authenticate cloudquery - you can read more about it here.
IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers).

Auth Details #

Required Permissions #

Environment Variables #

export AWS_ACCESS_KEY_ID='{Your AWS Access Key ID}'
export AWS_SECRET_ACCESS_KEY='{Your AWS secret access key}'
export AWS_SESSION_TOKEN='{Your AWS session token}'

Shared Configuration files #

[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

You can also specify credentials for a different profile, and instruct CloudQuery to use the credentials from this profile instead of the default one. For example:

[myprofile]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

Then, you can either export the AWS_PROFILE environment variable (On Linux/Mac, similar for Windows):

export AWS_PROFILE=myprofile

or, configure your desired profile in the local_profile field:

accounts:
  id: <account_alias>
  local_profile: myprofile

IAM Roles for AWS Compute Resources #

User Credentials with MFA #

aws sts get-session-token --serial-number <YOUR_MFA_SERIAL_NUMBER> --token-code <YOUR_MFA_TOKEN_CODE> --duration-seconds 3600

Then export the temporary credentials to your environment variables.

export AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
export AWS_SESSION_TOKEN=<YOUR_SESSION_TOKEN>

Name

License

github.com/JohnCGriffin/overflow

MIT

github.com/adrg/xdg

MIT

github.com/andybalholm/brotli

MIT

github.com/apache/arrow/go/v13

Apache-2.0

github.com/apache/arrow-go/v18

Apache-2.0

github.com/apache/thrift/lib/go/thrift

Apache-2.0

github.com/apapsch/go-jsonmerge/v2

MIT

github.com/aws/aws-sdk-go-v2

Apache-2.0

github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream

Apache-2.0

github.com/aws/aws-sdk-go-v2/config

Apache-2.0

github.com/aws/aws-sdk-go-v2/credentials

Apache-2.0

github.com/aws/aws-sdk-go-v2/feature/ec2/imds

Apache-2.0

github.com/aws/aws-sdk-go-v2/feature/s3/manager

Apache-2.0

github.com/aws/aws-sdk-go-v2/internal/configsources

Apache-2.0

github.com/aws/aws-sdk-go-v2/internal/endpoints/v2

Apache-2.0

github.com/aws/aws-sdk-go-v2/internal/ini

Apache-2.0

github.com/aws/aws-sdk-go-v2/internal/sync/singleflight

BSD-3-Clause

github.com/aws/aws-sdk-go-v2/internal/v4a

Apache-2.0

github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding

Apache-2.0

github.com/aws/aws-sdk-go-v2/service/internal/checksum

Apache-2.0

github.com/aws/aws-sdk-go-v2/service/internal/presigned-url

Apache-2.0

github.com/aws/aws-sdk-go-v2/service/internal/s3shared

Apache-2.0

github.com/aws/aws-sdk-go-v2/service/licensemanager

Apache-2.0

github.com/aws/aws-sdk-go-v2/service/marketplacemetering

Apache-2.0

github.com/aws/aws-sdk-go-v2/service/s3

Apache-2.0

github.com/aws/aws-sdk-go-v2/service/sso

Apache-2.0

github.com/aws/aws-sdk-go-v2/service/ssooidc

Apache-2.0

github.com/aws/aws-sdk-go-v2/service/sts

Apache-2.0

github.com/aws/smithy-go

Apache-2.0

github.com/aws/smithy-go/internal/sync/singleflight

BSD-3-Clause

github.com/bahlo/generic-list-go

BSD-3-Clause

github.com/buger/jsonparser

MIT

github.com/cenkalti/backoff/v4

MIT

github.com/cloudquery/cloudquery-api-go

MPL-2.0

github.com/cloudquery/codegen/jsonschema

MPL-2.0

github.com/cloudquery/plugin-pb-go

MPL-2.0

github.com/cloudquery/plugin-sdk/v2/internal/glob

MIT

github.com/cloudquery/plugin-sdk/v2/schema

MIT

github.com/cloudquery/plugin-sdk/v2/types

MPL-2.0

github.com/cloudquery/plugin-sdk/v4

MPL-2.0

github.com/cloudquery/plugin-sdk/v4/glob

MIT

github.com/cloudquery/plugin-sdk/v4/scalar

MIT

github.com/davecgh/go-spew/spew

ISC

github.com/ghodss/yaml

MIT

github.com/go-logr/logr

Apache-2.0

github.com/go-logr/stdr

Apache-2.0

github.com/goccy/go-json

MIT

github.com/golang/snappy

BSD-3-Clause

github.com/google/flatbuffers/go

Apache-2.0

github.com/google/uuid

BSD-3-Clause

github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors

Apache-2.0

github.com/grpc-ecosystem/grpc-gateway/v2

BSD-3-Clause

github.com/hashicorp/go-cleanhttp

MPL-2.0

github.com/hashicorp/go-retryablehttp

MPL-2.0

github.com/huandu/xstrings

MIT

github.com/invopop/jsonschema

MIT

github.com/klauspost/compress

Apache-2.0

github.com/klauspost/compress/internal/snapref

BSD-3-Clause

github.com/klauspost/compress/zstd/internal/xxhash

MIT

github.com/klauspost/cpuid/v2

MIT

github.com/mailru/easyjson

MIT

github.com/mattn/go-colorable

MIT

github.com/mattn/go-isatty

MIT

github.com/oapi-codegen/runtime

Apache-2.0

github.com/pierrec/lz4/v4

BSD-3-Clause

github.com/pmezard/go-difflib/difflib

BSD-3-Clause

github.com/rs/zerolog

MIT

github.com/santhosh-tekuri/jsonschema/v6

Apache-2.0

github.com/spf13/cobra

Apache-2.0

github.com/spf13/pflag

BSD-3-Clause

github.com/stretchr/testify

MIT

github.com/thoas/go-funk

MIT

github.com/wk8/go-ordered-map/v2

Apache-2.0

github.com/zeebo/xxh3

BSD-2-Clause

go.opentelemetry.io/otel

Apache-2.0

go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploghttp

Apache-2.0

go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp

Apache-2.0

go.opentelemetry.io/otel/exporters/otlp/otlptrace

Apache-2.0

go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp

Apache-2.0

go.opentelemetry.io/otel/log

Apache-2.0

go.opentelemetry.io/otel/metric

Apache-2.0

go.opentelemetry.io/otel/sdk

Apache-2.0

go.opentelemetry.io/otel/sdk/log

Apache-2.0

go.opentelemetry.io/otel/sdk/metric

Apache-2.0

go.opentelemetry.io/otel/trace

Apache-2.0

go.opentelemetry.io/proto/otlp

Apache-2.0

golang.org/x/exp

BSD-3-Clause

golang.org/x/net

BSD-3-Clause

golang.org/x/sync/errgroup

BSD-3-Clause

golang.org/x/sys

BSD-3-Clause

golang.org/x/text

BSD-3-Clause

golang.org/x/xerrors

BSD-3-Clause

google.golang.org/genproto/googleapis/api/httpbody

Apache-2.0

google.golang.org/genproto/googleapis/rpc/status

Apache-2.0

google.golang.org/grpc

Apache-2.0

google.golang.org/protobuf

BSD-3-Clause

gopkg.in/yaml.v2

Apache-2.0

gopkg.in/yaml.v3

MIT