Back to source list
Official
Premium
BigQuery
Sync from BigQuery to any destination
Publisher
cloudquery
Latest version
v1.9.2
Type
Source
Platforms
Date Published
Price per 1M rows
Starting from $15
monthly free quota
1M rows
Set up process #
brew install cloudquery/tap/cloudquery
1. Download CLI and login
2. Create source and destination configs
Plugin configurationOverview #
The CloudQuery BigQuery source plugin syncs tables from BigQuery to any of the supported CloudQuery destinations (e.g. PostgreSQL, BigQuery, Snowflake, and more).
Example Configuration #
This example syncs from BigQuery to a Postgres destination. The (top level) source spec section is described in the Source Spec Reference.
kind: source
spec:
name: bigquery
path: cloudquery/bigquery
registry: cloudquery
version: "v1.9.2"
tables: ["*"]
destinations: ["postgresql"]
# Learn more about the configuration options at https://cql.ink/bigquery_source
spec:
project_id: ${PROJECT_ID}
dataset_id: ${DATASET_ID}
# Optional parameters
# dataset_location: ""
# service_account_key_json: ""
# endpoint: ""
Configuration Reference #
This is the (nested) spec used by the BigQuery source plugin.
project_id
(string
) (required)The id of the project where the destination BigQuery database resides.dataset_id
(string
) (required)The id of the BigQuery dataset within the project, e.g.my_dataset
.dataset_location
(string
) (optional)The data location of the BigQuery dataset. If set, will be used as the default location for job operations. Pro-tip: this can solve "dataset not found" issues for newly created datasets.service_account_key_json
(string
) (optional) (default: empty).GCP service account key content. This allows for using different service accounts for the GCP source and BigQuery destination. If using service account keys, it is best to use environment or file variable substitution.endpoint
(string
) (optional)The BigQuery API endpoint to use. This is useful for testing against a local emulator.table_options
(map of Table Options Spec) (optional)Table options to set for specific tables. Untemplated table names are the keys. Only valid ifqueries
is empty.queries
(list of queries Spec) (optional)(Preview feature) List of queries to run instead of directly reading tables. Thetables
in top-level spec should be left as*
or can be a subset of these tables.destination_table_name
(string
) (optional) (default:{{TABLE}}
)The destination table name to write the data to. (Preview Feature)Supports the following placeholder variables:{{TABLE}}
will be replaced with the table name{{DATASET}}
will be replaced with the dataset name{{UUID}}
will be replaced with a random UUID in raw format to uniquely identify each table{{YEAR}}
will be replaced with the current year inYYYY
format{{MONTH}}
will be replaced with the current month inMM
format{{DAY}}
will be replaced with the current day inDD
format{{HOUR}}
will be replaced with the current hour inHH
format{{MINUTE}}
will be replaced with the current minute inmm
format
At least one of{{TABLE}}
or{{UUID}}
is required.Note that timestamps are inUTC
and will be the time sync started.concurrency
(integer
) (optional) (default:100
)Number of tables to sync concurrently. Lower or increase this number based on your database size and available resources.discovery_concurrency
(integer
) (optional) (default:100
)Number of goroutines to use for discovering table schemas.scheduler
(string
) (optional) (default:dfs
)The scheduler to use when determining the priority of resources to sync. Supported values aredfs
(depth-first search),round-robin
,shuffle
andshuffle-queue
.For more information about this, see performance tuning.
Table Options Spec (Preview) #
Enables setting table options for specific tables. Map key is the name of the table in untemplated form.
incremental_column
(string
) (optional)
Name of the incremental column in the table. If empty, no incremental column will be used.
Queries Spec (Preview) #
Allows running arbitrary queries instead of fetching existing tables. Each query will be run as a separate table.
name
(string
) (required)Name of the table to be generated from the query.query
(string
) (required)SQL query to run. Should have a{incremental_column} > @cursor
or similar in the WHERE clause ifincremental_column
is set.incremental_column
(string
) (optional)Name of the incremental column in the query result set. The query must have a reference to@cursor
in its WHERE clause, and a column with this name in the result set. If empty, no incremental column will be used.incremental_column_type
(enum
) (required ifincremental_column
is set)Type of the incremental column, one ofTIMESTAMP
,INTEGER
orSTRING
. This is required ifincremental_column
is set.
Authentication #
The BigQuery plugin authenticates using your Application Default Credentials. Available options are all the same options described here in detail:
Local Environment:
gcloud auth application-default login
(recommended when running locally)
Google Cloud cloud-based development environment:
- When you run on Cloud Shell or Cloud Code credentials are already available.
Google Cloud containerized environment:
- When running on GKE use workload identity.
- Services such as Compute Engine, App Engine and functions supporting attaching a user-managed service account which will CloudQuery will be able to utilize.
On-premises or another cloud provider
- The suggested way is to use Workload identity federation
- If not available you can always use service account keys and export the location of the key via
GOOGLE_APPLICATION_CREDENTIALS
. (Not recommended as long-lived keys are a security risk)
Underlying library #
We use the official cloud.google.com/go/bigquery package for database connection.
Incremental-Examples #
Configuration Examples for Incremental Sync
To sync tables incrementally each table needs to have an
incremental_column
specified in the table_options
section of the configuration. The tables without an incremental_column
will be synced fully.Sync a Table Incrementally #
To sync a table incrementally, you need to specify the
incremental_column
in the table_options
section of the configuration. Here's an example source spec to sync a list of tables incrementally:kind: source
spec:
name: bigquery
path: cloudquery/bigquery
registry: cloudquery
version: "v1.9.2"
tables:
- "my_table"
- "another_table"
- "yet_another_table"
destinations: ["postgresql"]
backend_options:
table_name: "cq_state_bq"
connection: "@@plugins.postgresql.connection"
spec:
project_id: my-project
dataset_id: my_dataset
table_options:
my_table:
incremental_column: updated_at
another_table:
incremental_column: id
In the example above, the
my_table
table will be synced incrementally based on the updated_at
column. another_table
will be synced incrementally based on the id
column. yet_another_table
will be synced fully.The incremental column will be used to fetch only the new or updated rows from the source table and the state will be stored in the
cq_state_bq
table in the PostgreSQL destination.Sync a Custom Query Incrementally #
To sync a custom query incrementally, you need to specify the
incremental_column
as well as incremental_column_type
in the queries Spec. Here's an example source spec sync a custom query incrementally:kind: source
spec:
name: bigquery
path: cloudquery/bigquery
registry: cloudquery
version: "v1.9.2"
tables:
- "my_query_result"
destinations: ["postgresql"]
backend_options:
table_name: "cq_state_bq"
connection: "@@plugins.postgresql.connection"
spec:
project_id: my-project
dataset_id: my_dataset
queries:
- name: my_query_result
query: "SELECT * FROM my_table WHERE updated_at > @cursor"
incremental_column: updated_at
incremental_column_type: timestamp
In the example above, the
my_query_result
table will be generated from the custom query SELECT * FROM my_table WHERE updated_at > @cursor
and will be synced incrementally based on the updated_at
column.
@cursor
is a placeholder that will be replaced with the last sync time. If no sync time is available, it will be replaced with the positive minimum value of that type.