Back to source list
Official
Premium
S3
The CloudQuery S3 source plugin reads parquet files and loads them into any supported CloudQuery destination (e.g. PostgreSQL, BigQuery, Snowflake, and more)
Publisher
cloudquery
Latest version
v1.6.1
Type
Source
Platforms
Date Published
Overview #
The CloudQuery S3 source plugin reads parquet files and loads them into any supported CloudQuery destination (e.g. PostgreSQL, BigQuery, Snowflake, and more). The S3 source plugin assumes that all unique prefixes in the S3 bucket are unique tables, and for those objects in the root of the bucket, the table name is the name of the object. For example if you have the following objects in your s3 bucket:
s3://<bucket>/datafile_0.parquet
s3://<bucket>/datafile_1.parquet
s3://<bucket>/data/2024/datafile_1.parquet
s3://<bucket>/data/2024/02/14/14/15/datafile_3.parquet
s3://<bucket>/data/2024/02/14/14/15/datafile_4.parquet
CloudQuery will sync the following tables:
datafile_0.parquet --> datafile_0
datafile_1.parquet --> datafile_1
data/2024/datafile_1.parquet --> data_2024
data/2024/02/14/14/15/datafile_3.parquet --> data_2024_02_14_14_15
data/2024/02/14/14/15/datafile_4.parquet --> data_2024_02_14_14_15
Incremental Syncing #
The S3 plugin supports incremental syncing. This means that only new files will be fetched from S3 and loaded into your destination. This is done by keeping track of the time of the last sync and comparing it against the last modified date of each file to only fetch new files. This assumes that S3 files are immutable.
To enable this,
backend_options
must be set in the spec (as shown below). This is documented in the Managing Incremental Tables section.Configuration #
kind: source
spec:
name: s3
path: cloudquery/s3
registry: cloudquery
version: "v1.6.1"
tables: ["*"]
destinations: ["postgresql"]
backend_options:
table_name: "cq_state_s3"
connection: "@@plugins.postgresql.connection"
# Learn more about the configuration options at https://cql.ink/s3_source
spec:
# TODO: Update it with the actual spec
bucket: "<BUCKET_NAME>"
region: "<REGION>"
# Optional parameters
# path_prefix: ""
# rows_per_record: 500
# concurrency: 50
S3 spec #
This is the (nested) spec used by the S3 source plugin.
bucket
(string
) (required)The name of the S3 bucket that will be synced.region
(string
) (required)The AWS region of the S3 bucket.local_profile
(string
) (optional) (default: will use current credentials)Local profile to use to authenticate this account with. Please note this should be set to the name of the profile.For example, with the following credentials file:[default] aws_access_key_id=xxxx aws_secret_access_key=xxxx [user1] aws_access_key_id=xxxx aws_secret_access_key=xxxx
local_profile
should be set to eitherdefault
oruser1
.path_prefix
(string
) (optional) (default:""
)The path prefix that will limit the files to sync.filetype
(string
) (optional) (default:parquet
)Type of file that will be synced. Currently onlyparquet
is supported.rows_per_record
(integer
) (optional) (default:500
)Amount of rows to be packed into a single Apache Arrow record to be sent over the wire during sync.concurrency
(integer
) (optional) (default:50
)Number of objects to sync in parallel. Negative values mean no limit.
Licenses #
The following tools / packages are used in this plugin:
Name | License |
---|---|
github.com/JohnCGriffin/overflow | MIT |
github.com/adrg/xdg | MIT |
github.com/andybalholm/brotli | MIT |
github.com/apache/arrow-go/v18 | Apache-2.0 |
github.com/apache/arrow/go/v13 | Apache-2.0 |
github.com/apache/thrift/lib/go/thrift | Apache-2.0 |
github.com/apapsch/go-jsonmerge/v2 | MIT |
github.com/aws/aws-sdk-go-v2 | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/config | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/credentials | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/feature/ec2/imds | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/internal/configsources | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/internal/ini | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/internal/sync/singleflight | BSD-3-Clause |
github.com/aws/aws-sdk-go-v2/internal/v4a | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/service/internal/checksum | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/service/internal/s3shared | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/service/licensemanager | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/service/marketplacemetering | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/service/s3 | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/service/sso | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/service/ssooidc | Apache-2.0 |
github.com/aws/aws-sdk-go-v2/service/sts | Apache-2.0 |
github.com/aws/smithy-go | Apache-2.0 |
github.com/aws/smithy-go/internal/sync/singleflight | BSD-3-Clause |
github.com/cenkalti/backoff/v4 | MIT |
github.com/cloudquery/cloudquery-api-go | MPL-2.0 |
github.com/cloudquery/plugin-pb-go | MPL-2.0 |
github.com/cloudquery/plugin-sdk/v2/internal/glob | MIT |
github.com/cloudquery/plugin-sdk/v2/schema | MIT |
github.com/cloudquery/plugin-sdk/v2/types | MPL-2.0 |
github.com/cloudquery/plugin-sdk/v4 | MPL-2.0 |
github.com/cloudquery/plugin-sdk/v4/glob | MIT |
github.com/cloudquery/plugin-sdk/v4/scalar | MIT |
github.com/davecgh/go-spew/spew | ISC |
github.com/ghodss/yaml | MIT |
github.com/go-logr/logr | Apache-2.0 |
github.com/go-logr/stdr | Apache-2.0 |
github.com/goccy/go-json | MIT |
github.com/golang/snappy | BSD-3-Clause |
github.com/google/flatbuffers/go | Apache-2.0 |
github.com/google/uuid | BSD-3-Clause |
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors | Apache-2.0 |
github.com/grpc-ecosystem/grpc-gateway/v2 | BSD-3-Clause |
github.com/hashicorp/go-cleanhttp | MPL-2.0 |
github.com/hashicorp/go-retryablehttp | MPL-2.0 |
github.com/klauspost/compress | Apache-2.0 |
github.com/klauspost/compress/internal/snapref | BSD-3-Clause |
github.com/klauspost/compress/zstd/internal/xxhash | MIT |
github.com/klauspost/cpuid/v2 | MIT |
github.com/mattn/go-colorable | MIT |
github.com/mattn/go-isatty | MIT |
github.com/oapi-codegen/runtime | Apache-2.0 |
github.com/pierrec/lz4/v4 | BSD-3-Clause |
github.com/pmezard/go-difflib/difflib | BSD-3-Clause |
github.com/rs/zerolog | MIT |
github.com/santhosh-tekuri/jsonschema/v6 | Apache-2.0 |
github.com/spf13/cobra | Apache-2.0 |
github.com/spf13/pflag | BSD-3-Clause |
github.com/stretchr/testify | MIT |
github.com/thoas/go-funk | MIT |
github.com/zeebo/xxh3 | BSD-2-Clause |
go.opentelemetry.io/auto/sdk | Apache-2.0 |
go.opentelemetry.io/otel | Apache-2.0 |
go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploghttp | Apache-2.0 |
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp | Apache-2.0 |
go.opentelemetry.io/otel/exporters/otlp/otlptrace | Apache-2.0 |
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp | Apache-2.0 |
go.opentelemetry.io/otel/log | Apache-2.0 |
go.opentelemetry.io/otel/metric | Apache-2.0 |
go.opentelemetry.io/otel/sdk | Apache-2.0 |
go.opentelemetry.io/otel/sdk/log | Apache-2.0 |
go.opentelemetry.io/otel/sdk/metric | Apache-2.0 |
go.opentelemetry.io/otel/trace | Apache-2.0 |
go.opentelemetry.io/proto/otlp | Apache-2.0 |
golang.org/x/exp | BSD-3-Clause |
golang.org/x/net | BSD-3-Clause |
golang.org/x/sync/errgroup | BSD-3-Clause |
golang.org/x/sys | BSD-3-Clause |
golang.org/x/text | BSD-3-Clause |
golang.org/x/xerrors | BSD-3-Clause |
google.golang.org/genproto/googleapis/api/httpbody | Apache-2.0 |
google.golang.org/genproto/googleapis/rpc/status | Apache-2.0 |
google.golang.org/grpc | Apache-2.0 |
google.golang.org/protobuf | BSD-3-Clause |
gopkg.in/yaml.v2 | Apache-2.0 |
gopkg.in/yaml.v3 | MIT |