Back to destination list
Official
Google Cloud Storage
This destination plugin lets you sync data from a CloudQuery source to remote GCS (Google Cloud Storage) storage in various formats such as CSV, JSON and Parquet
Price
Free
Overview #
GCS (Google Cloud Storage) Destination Plugin
This destination plugin lets you sync data from a CloudQuery source to remote GCS (Google Cloud Storage) storage in various formats such as CSV, JSON and Parquet.
This is useful in various use-cases, especially in data lakes where you can query the data direct from Athena or load it to various data warehouses such as BigQuery, RedShift, Snowflake and others.
Example #
This example configures a GCS destination, to create CSV files in
gcs://bucket_name/path/to/files
.kind: destination
spec:
name: "gcs"
path: "cloudquery/gcs"
registry: "cloudquery"
version: "v5.3.1"
write_mode: "append"
spec:
bucket: "bucket_name"
path: "path/to/files/{{TABLE}}/{{UUID}}.{{FORMAT}}"
format: "parquet" # options: parquet, json, csv
format_spec:
# CSV specific parameters:
# delimiter: ","
# skip_header: false
# Parquet specific parameters:
# version: "v2Latest"
# root_repetition: "repeated"
# max_row_group_length: 134217728 # 128 * 1024 * 1024
# Optional parameters
# compression: "" # options: gzip
# no_rotate: false
# batch_size: 10000
# batch_size_bytes: 52428800 # 50 MiB
# batch_timeout: 30s
Note that the GCS plugin only supports
append
write_mode
. The (top level) spec section is described in the Destination Spec Reference.The GCS destination utilizes batching, and supports
batch_size
, batch_size_bytes
and batch_timeout
options (see below).GCS Spec #
This is the (nested) spec used by the CSV destination Plugin.
bucket
(string
) (required)Bucket where to sync the files.path
(string
) (required)Path to where the files will be uploaded in the above bucket, for examplepath/to/files/{{TABLE}}/{{UUID}}.parquet
.If no path variables are present, the path will be appended with TABLE, FORMAT and Compression extension by default.The path supports the following placeholder variables:{{TABLE}}
will be replaced with the table name{{SYNC_ID}}
will be replaced with the unique identifier of the sync. This value is a UUID and is randomly generated for each sync.{{FORMAT}}
will be replaced with the file format, such ascsv
,json
orparquet
. If compression is enabled, the format will becsv.gz
,json.gz
etc.{{UUID}}
will be replaced with a random UUID to uniquely identify each file{{YEAR}}
will be replaced with the current year inYYYY
format{{MONTH}}
will be replaced with the current month inMM
format{{DAY}}
will be replaced with the current day inDD
format{{HOUR}}
will be replaced with the current hour inHH
format{{MINUTE}}
will be replaced with the current minute inmm
format
format
(string
) (required)Format of the output file. Supported values arecsv
,json
andparquet
.format_spec
(format_spec) (optional)Optional parameters to change the format of the file.compression
(string
) (optional) (default: empty)Compression algorithm to use. Supported values are empty orgzip
. Not supported forparquet
format.no_rotate
(boolean
) (optional) (default:false
)If set totrue
, the plugin will write to one file per table. Otherwise, for every batch a new file will be created with a different.<UUID>
suffix.batch_size
(integer
) (optional) (default:10000
)Number of records to write before starting a new object.batch_size_bytes
(integer
) (optional) (default:52428800
(50 MiB))Number of bytes (as Arrow buffer size) to write before starting a new object.batch_timeout
(duration
) (optional) (default:30s
(30 seconds))Maximum interval between batch writes.
format_spec #
CSV
delimiter
(string
) (optional) (default:,
)Delimiter to use in the CSV file.skip_header
(boolean
) (optional) (default:false
)If set totrue
, the CSV file will not contain a header row as the first row.
JSON
Reserved for future use.
Parquet
version
(string
) (optional) (default:v2Latest
)Parquet format version to use. Supported values arev1.0
,v2.4
,v2.6
andv2Latest
.v2Latest
is an alias for the latest version available in the Parquet library which is currentlyv2.6
.Useful when the reader consuming the Parquet files does not support the latest version.root_repetition
(string
) (optional) (default:repeated
)Repetition option to use for the root node. Supported values areundefined
,required
,optional
andrepeated
.Some Parquet readers require a specific root repetition option to be able to read the file. For example, importing Parquet files into Snowflake requires the root repetition to beundefined
.max_row_group_length
(integer
) (optional) (default:134217728
(= 128 * 1024 * 1024))The maximum number of rows in a single row group. Use a lower number to reduce memory usage when reading the Parquet files, and a higher number to increase the efficiency of reading the Parquet files.
Authentication #
The GCS plugin authenticates using your Application Default Credentials. Available options are all the same options described here in detail:
Local Environment:
gcloud auth application-default login
(recommended when running locally)
Google Cloud cloud-based development environment:
- When you run on Cloud Shell or Cloud Code credentials are already available.
Google Cloud containerized environment:
- When running on GKE use workload identity.
- Services such as Compute Engine, App Engine and functions supporting attaching a user-managed service account which will CloudQuery will be able to utilize.
On-premises or another cloud provider
- The suggested way is to use Workload identity federation
- If not available you can always use service account keys and export the location of the key via
GOOGLE_APPLICATION_CREDENTIALS
. (Not recommended as long-lived keys are a security risk)