Back to destination list
Official
S3
This destination plugin lets you sync data from a CloudQuery source to remote S3 storage in various formats such as CSV, JSON and Parquet
Price
Free
Overview #
S3 Destination Plugin
This destination plugin lets you sync data from a CloudQuery source to remote S3 storage in various formats such as CSV, JSON and Parquet.
This is useful in various use-cases, especially in data lakes where you can query the data direct from Athena or load it to various data warehouses such as BigQuery, RedShift, Snowflake and others.
Example #
This example uses the parquet format, to create parquet files in
s3://bucket_name/path/to/files
, with each table placed in its own directory.The (top level) spec section is described in the Destination Spec Reference.
kind: destination
spec:
name: "s3"
path: "cloudquery/s3"
registry: "cloudquery"
version: "v7.5.2"
write_mode: "append"
# Learn more about the configuration options at https://cql.ink/s3_destination
spec:
bucket: "bucket_name"
region: "region-name" # Example: us-east-1
path: "path/to/files/{{TABLE}}/{{UUID}}.{{FORMAT}}"
format: "parquet" # options: parquet, json, csv
format_spec:
# CSV specific parameters:
# delimiter: ","
# skip_header: false
# Parquet specific parameters:
# version: "v2Latest"
# root_repetition: "repeated"
# max_row_group_length: 134217728 # 128 * 1024 * 1024
# Optional parameters
# compression: "" # options: gzip
# no_rotate: false
# athena: false # <- set this to true for Athena compatibility
# write_empty_objects_for_empty_tables: false # <- set this to true if using with the CloudQuery Compliance policies
# test_write: true # tests the ability to write to the bucket before processing the data
# endpoint: "" # Endpoint to use for S3 API calls.
# endpoint_skip_tls_verify # Disable TLS verification if using an untrusted certificate
# use_path_style: false
# batch_size: 10000 # 10K entries
# batch_size_bytes: 52428800 # 50 MiB
# batch_timeout: 30s # 30 seconds
It is also possible to use
{{YEAR}}
, {{MONTH}}
, {{DAY}}
and {{HOUR}}
in the path to create a directory structure based on the current time. For example:path: "path/to/files/{{TABLE}}/dt={{YEAR}}-{{MONTH}}-{{DAY}}/{{UUID}}.parquet"
Other supported formats are
json
and csv
.Note that the S3 plugin only supports
append
write_mode
. The (top level) spec section is described in the Destination Spec Reference.The S3 destination utilizes batching, and supports
batch_size
, batch_size_bytes
and batch_timeout
options (see below).S3 Spec #
This is the (nested) spec used by the CSV destination Plugin.
bucket
(string
) (required)Bucket where to sync the files.region
(string
) (required)Region where bucket is located.path
(string
) (required)Path to where the files will be uploaded in the above bucket, for examplepath/to/files/{{TABLE}}/{{UUID}}.parquet
.The path supports the following placeholder variables:{{TABLE}}
will be replaced with the table name{{TABLE_HYPHEN}}
will be replaced with the table name with hyphens instead of underscores.{{SYNC_ID}}
will be replaced with the unique identifier of the sync. This value is a UUID and is randomly generated for each sync.{{FORMAT}}
will be replaced with the file format, such ascsv
,json
orparquet
. If compression is enabled, the format will becsv.gz
,json.gz
etc.{{UUID}}
will be replaced with a random UUID to uniquely identify each file{{YEAR}}
will be replaced with the current year inYYYY
format{{MONTH}}
will be replaced with the current month inMM
format{{DAY}}
will be replaced with the current day inDD
format{{HOUR}}
will be replaced with the current hour inHH
format{{MINUTE}}
will be replaced with the current minute inmm
format
Note that timestamps are inUTC
and will be the current time at the time the file is written, not when the sync started.format
(string
) (required)Format of the output file. Supported values arecsv
,json
andparquet
.format_spec
(format_spec) (optional)Optional parameters to change the format of the file.server_side_encryption_configuration
(server_side_encryption_configuration) (optional)Optional parameters to enable server-side encryption.compression
(string
) (optional) (default:""
)Compression algorithm to use. Supported values are""
orgzip
. Not supported forparquet
format.no_rotate
(boolean
) (optional) (default:false
)If set totrue
, the plugin will write to one file per table. Otherwise, for every batch a new file will be created with a different.<UUID>
suffix.athena
(boolean
) (optional) (default:false
)Whenathena
is set totrue
, the S3 plugin will sanitize keys in JSON columns to be compatible with the Hive Metastore / Athena. This allows tables to be created with a Glue Crawler and then queried via Athena, without changes to the table schema.write_empty_objects_for_empty_tables
(boolean
) (optional) (default:false
)By default only tables with resources are persisted to objects during the sync. If you'd like to persist empty objects for empty tables enable this option. Useful when using CloudQuery Compliance policies to ensure all tables have their schema populated by a query engine like Athenatest_write
(boolean
) (optional) (default:true
)Ensure write access to the given bucket and path by writing a test object on each sync. If you are sure that the bucket and path are writable, you can set this tofalse
to skip the test.endpoint
(string
) (optional) (default:""
)Endpoint to use for S3 API calls. This is useful for S3-compatible storage services such as MinIO. Note: if you want to use path-style addressing, i.e.,https://s3.amazonaws.com/BUCKET/KEY
,use_path_style
should be enabled, too.acl
(string
) (optional) (default:""
)Canned ACL to apply to the object. Supported values areprivate
,public-read
,public-read-write
,authenticated-read
,aws-exec-read
,bucket-owner-read
,bucket-owner-full-control
.endpoint_skip_tls_verify
(boolean
) (optional) (default:false
)Disable TLS verification for requests to your S3 endpoint.This option is intended to be used when using a custom endpoint using theendpoint
option.use_path_style
(boolean
) (optional) (default:false
)Allows to use path-style addressing in theendpoint
option, i.e.,https://s3.amazonaws.com/BUCKET/KEY
. By default, the S3 client will use virtual hosted bucket addressing when possible (https://BUCKET.s3.amazonaws.com/KEY
).batch_size
(integer
) (optional) (default:10000
)Number of records to write before starting a new object.batch_size_bytes
(integer
) (optional) (default:52428800
(= 50 MiB))Number of bytes (as Arrow buffer size) to write before starting a new object.batch_timeout
(duration
) (optional) (default:30s
(30 seconds))Maximum interval between batch writes.
format_spec #
CSV
delimiter
(string
) (optional) (default:,
)Delimiter to use in the CSV file.skip_header
(boolean
) (optional) (default:false
)If set totrue
, the CSV file will not contain a header row as the first row.
JSON
Reserved for future use.
Parquet
version
(string
) (optional) (default:v2Latest
)Parquet format version to use. Supported values arev1.0
,v2.4
,v2.6
andv2Latest
.v2Latest
is an alias for the latest version available in the Parquet library which is currentlyv2.6
.Useful when the reader consuming the Parquet files does not support the latest version.root_repetition
(string
) (optional) (default:repeated
)Repetition option to use for the root node. Supported values areundefined
,required
,optional
andrepeated
.Some Parquet readers require a specific root repetition option to be able to read the file. For example, importing Parquet files into Snowflake requires the root repetition to beundefined
.max_row_group_length
(integer
) (optional) (default:134217728
(= 128 * 1024 * 1024))The maximum number of rows in a single row group. Use a lower number to reduce memory usage when reading the Parquet files, and a higher number to increase the efficiency of reading the Parquet files.
server_side_encryption_configuration #
sse_kms_key_id
(string
) (required)KMS Key ID appended to S3 API calls header. Used in conjunction withserver_side_encryption
.server_side_encryption
(string
) (required)The server-side encryption algorithm used when storing the object in S3. Supported values areAES256
,aws:kms
andaws:kms:dsse
.
Authentication #
The plugin needs to be authenticated with your account(s) in order to sync information from your cloud setup.
The plugin requires only
PutObject
permissions (we will never make any changes to your cloud setup), so, following the principle of least privilege, it's recommended to grant it PutObject
permissions.There are multiple ways to authenticate with AWS, and the plugin respects the AWS credential provider chain. This means that CloudQuery will follow the following priorities when attempting to authenticate:
- The
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
,AWS_SESSION_TOKEN
environment variables. - The
credentials
andconfig
files in~/.aws
(thecredentials
file takes priority). - You can also use
aws sso
to authenticate cloudquery - you can read more about it here. - IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers).
Environment Variables #
CloudQuery can use the credentials from the
AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and
AWS_SESSION_TOKEN
environment variables (AWS_SESSION_TOKEN
can be optional for some accounts). For information on obtaining credentials, see the AWS guide.To export the environment variables (On Linux/Mac - similar for Windows):
export AWS_ACCESS_KEY_ID='{Your AWS Access Key ID}'
export AWS_SECRET_ACCESS_KEY='{Your AWS secret access key}'
export AWS_SESSION_TOKEN='{Your AWS session token}'
Shared Configuration files #
The plugin can use credentials from your
credentials
and config
files in the .aws
directory in your home folder.
The contents of these files are practically interchangeable, but CloudQuery will prioritize credentials in the credentials
file.For information about obtaining credentials, see the
AWS guide.
Here are example contents for a
credentials
file:[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
You can also specify credentials for a different profile, and instruct CloudQuery to use the credentials from this profile instead of the default one.
For example:
[myprofile]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
Then, you can either export the
AWS_PROFILE
environment variable (On Linux/Mac, similar for Windows):export AWS_PROFILE=myprofile
IAM Roles for AWS Compute Resources #
The plugin can use IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers).
If you configured your AWS compute resources with IAM, the plugin will use these roles automatically.
For more information on configuring IAM, see the AWS docs here and here.
User Credentials with MFA #
In order to leverage IAM User credentials with MFA, the STS "get-session-token" command may be used with the IAM User's long-term security credentials (Access Key and Secret Access Key). For more information, see here.
aws sts get-session-token --serial-number <YOUR_MFA_SERIAL_NUMBER> --token-code <YOUR_MFA_TOKEN_CODE> --duration-seconds 3600
Then export the temporary credentials to your environment variables.
export AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
export AWS_SESSION_TOKEN=<YOUR_SESSION_TOKEN>
Using a Custom S3 Endpoint #
If you are using a custom S3 endpoint, you can specify it using the
endpoint
spec option. If you're using authentication, the region
option in the spec determines the signing region used.