Back to plugin list
databricks
Official

Databricks

This plugin is in preview.

Sync your data from any supported CloudQuery source into the Databricks Data Intelligence Platform.

Publisher

cloudquery

Repositorygithub.com
Latest version

v1.0.1

Type

Destination

Platforms
Date Published

Price

Free

Overview

Databricks destination plugin

This destination plugin lets you sync data from a CloudQuery source to Databricks.
Supported Databricks versions: >= 12

Configuration

Example

kind: destination
spec:
  name: "databricks"
  path: "cloudquery/databricks"
  registry: "cloudquery"
  version: "v1.0.1"
  write_mode: "append"
  spec:
    hostname: ${DATABRICKS_HOSTNAME} # optionally it can include protocol like https://abc.cloud.databricks.com
    http_path: ${DATABRICKS_HTTP_PATH} # HTTP path for SQL compute
    staging_path: ${DATABRICKS_STAGING_PATH} # Databricks FileStore or Unity volume path to store temporary files for staging
    auth:
      access_token: ${DATABRICKS_ACCESS_TOKEN}
    # Optional parameters
    # protocol: https
    # port: 443
    # catalog: ""
    # schema: "default"
    # migration_concurrency: 10
    # timeout: 1m
    # batch:
    #   size: 10000
    #   bytes: 5242880 # 5 MiB
    #   timeout: 20s
The (top level) spec section is described in the Destination Spec Reference.

Databricks spec

This is the (nested) spec used by the Databricks destination plugin.
  • hostname (string) (required)
    SQL compute hostname. May optionally include protocol value as well (like https://server.databricks.com).
  • http_path (string) (required)
    SQL compute HTTP path.
  • staging_path (string) (required)
    Unity volume path where temporary (staging) files should be uploaded to.
  • auth (Auth spec) (required)
    Authentication options.
  • protocol (string) (optional) (default: https)
    Protocol for connecting to Databricks. Can be also specified in the hostname.
  • port (integer) (optional) (default: 443)
    Port for connecting to Databricks.
  • catalog (string) (optional) (default: not used)
    Catalog to be used.
  • schema (string) (optional) (default: default)
    Schema to be used.
  • batch (Batching spec) (optional)
    Batching options.
  • migration_concurrency (integer) (optional) (default: 10)
    How many table operations will be performed in parallel during migration.
  • timeout (duration) (optional) (default: 1m (= 1 minute))
    Timeout for the queries.
Databricks authentication spec
This section allows specifying authentication method to connect to Databricks. Currently only personal access tokens are supported.
  • access_token (string) (required)
    Personal access token.
Batching spec
This section controls how data is batched for writing.
  • size (integer) (optional) (default: 10000)
    Maximum number of items that may be grouped together to be written in a single write.
  • bytes (integer) (optional) (default: 5242880 (= 5 MiB))
    Maximum size of items that may be grouped together to be written in a single write.
  • timeout (duration) (optional) (default: 1m (= 1 minute))
    Maximum interval between batch writes.


Types

Apache Arrow type conversion

The Databricks destination plugin supports most of Apache Arrow types. The following table shows the supported types and how they are mapped to Databricks data types.
Arrow Column TypeDatabricks Type
BinaryBINARY
Binary ViewBINARY
BooleanBOOLEAN
Date32DATE
Date64DATE
Decimal128 (Decimal)DECIMAL
Decimal256DECIMAL
Fixed Size BinaryBINARY
Fixed Size ListARRAY
Float16FLOAT
Float32FLOAT
Float64DOUBLE
Int8TINYINT
Int16SMALLINT
Int32INTEGER
Int64BIGINT
Large BinaryBINARY
Large ListARRAY
Large StringSTRING
ListARRAY
NullVOID
MapMAP
StringSTRING
String ViewSTRING
StructSTRUCT
Time32TIMESTAMP
Time64TIMESTAMP
TimestampTIMESTAMP
UUID (CloudQuery extension)STRING
Uint8SMALLINT
Uint16INTEGER
Uint32BIGINT
Uint64BIGINT


Subscribe to product updates

Be the first to know about new features.