Skip to main content

BigQuery

Setting up the BigQuery destination connector involves setting up the data loading method and configuring the BigQuery destination connector using the Airbyte UI.

This page guides you through setting up the BigQuery destination connector.

Prerequisites

Setup guide

Step 1: Set up a data loading method

Using Batched Standard Inserts

You can use the BigQuery driver's built-in conversion to take INSERT statements and convert that to file uploads which are then loaded into BigQuery in batches. This is the simplest way to load data into BigQuery in a performant way. These staging files are managed by BigQuery and deleted automatically after the load is complete.

Using a Google Cloud Storage bucket

If you want more control of how and where your staging files are stored, you can opt to use a GCS bucket.

To use a Google Cloud Storage bucket:

  1. Create a Cloud Storage bucket with the Protection Tools set to none or Object versioning. Make sure the bucket does not have a retention policy.
  2. Create an HMAC key and access ID.
  3. Grant the Storage Object Admin role to the Google Cloud Service Account. This must be the same service account as the one you configure for BigQuery access in the BigQuery connector setup step.
  4. Make sure your Cloud Storage bucket is accessible from the machine running Airbyte. The easiest way to verify if Airbyte is able to connect to your bucket is via the check connection tool in the UI.

Your bucket must be encrypted using a Google-managed encryption key (this is the default setting when creating a new bucket). We currently do not support buckets using customer-managed encryption keys (CMEK). You can view this setting under the "Configuration" tab of your GCS bucket, in the Encryption type row.

Step 2: Set up the BigQuery connector

  1. Log into your Airbyte Cloud or Airbyte Open Source account.
  2. Click Destinations and then click + New destination.
  3. On the Set up the destination page, select BigQuery from the Destination type dropdown.
  4. Enter the name for the BigQuery connector.
  5. For Project ID, enter your Google Cloud project ID.
  6. For Dataset Location, select the location of your BigQuery dataset.
warning

You cannot change the location later.

  1. For Default Dataset ID, enter the BigQuery Dataset ID.
  2. For Loading Method, select Batched Standard Inserts or GCS Staging.
  3. For Service Account Key JSON (Required for cloud, optional for open-source), enter the Google Cloud Service Account Key in JSON format.
note

Be sure to copy all contents in the Account Key JSON file including the brackets.

  1. For Transformation Query Run Type (Optional), select interactive to have BigQuery run interactive query jobs or batch to have BigQuery run batch queries.
note

Interactive queries are executed as soon as possible and count towards daily concurrent quotas and limits, while batch queries are executed as soon as idle resources are available in the BigQuery shared resource pool. If BigQuery hasn't started the query within 24 hours, BigQuery changes the job priority to interactive. Batch queries don't count towards your concurrent rate limit, making it easier to start many queries at once.

  1. For Google BigQuery Client Chunk Size (Optional), use the default value of 15 MiB. Later, if you see networking or memory management problems with the sync (specifically on the destination), try decreasing the chunk size. In that case, the sync will be slower but more likely to succeed.

Supported sync modes

The BigQuery destination connector supports the following sync modes:

  • Full Refresh Sync
  • Incremental - Append Sync
  • Incremental - Append + Deduped

Output schema

Airbyte outputs each stream into its own raw table in airbyte_internal dataset by default (can be overriden by user) and a final table with Typed columns. Contents in raw table are NOT deduplicated.

Raw Table schema

Airbyte fieldDescriptionColumn type
_airbyte_raw_idA UUID assigned to each processed eventSTRING
_airbyte_extracted_atA timestamp for when the event was pulled from the data sourceTIMESTAMP
_airbyte_loaded_atTimestamp to indicate when the record was loaded into Typed tablesTIMESTAMP
_airbyte_dataA JSON blob with the event data.STRING

Note: Although the contents of the _airbyte_data are fairly stable, schema of the raw table could be subject to change in future versions.

Final Table schema

  • airbyte_raw_id: A UUID assigned by Airbyte to each event that is processed. The column type in BigQuery is String.
  • airbyte_extracted_at: A timestamp representing when the event was pulled from the data source. The column type in BigQuery is Timestamp.
  • _airbyte_meta: A JSON blob representing typing errors. You can query these results to audit misformatted or unexpected data. The column type in BigQuery is JSON. ... and a column of the proper data type for each of the top-level properties from your source's schema. Arrays and Objects will remain as JSON columns in BigQuery. Learn more about Typing and Deduping here

The output tables in BigQuery are partitioned by the Time-unit column airbyte_extracted_at at a daily granularity and clustered by airbyte_extracted_at and the table Primary Keys. Partitions boundaries are based on UTC time. This is useful to limit the number of partitions scanned when querying these partitioned tables, by using a predicate filter (a WHERE clause). Filters on the partitioning column are used to prune the partitions and reduce the query cost. (The parameter Require partition filter is not enabled by Airbyte, but you may toggle it by updating the produced tables.)

BigQuery Naming Conventions

Follow BigQuery Datasets Naming conventions.

Airbyte converts any invalid characters into _ characters when writing data. However, since datasets that begin with _ are hidden on the BigQuery Explorer panel, Airbyte prepends the namespace with n for converted namespaces.

Data type map

Airbyte typeBigQuery type
STRINGSTRING
STRING (BASE64)STRING
STRING (BIG_NUMBER)STRING
STRING (BIG_INTEGER)STRING
NUMBERNUMERIC
INTEGERINT64
BOOLEANBOOL
STRING (TIMESTAMP_WITH_TIMEZONE)TIMESTAMP
STRING (TIMESTAMP_WITHOUT_TIMEZONE)DATETIME
STRING (TIME_WITH_TIMEZONE)STRING
STRING (TIME_WITHOUT_TIMEZONE)TIME
DATEDATE
OBJECTJSON
ARRAYJSON

Troubleshooting permission issues

The service account does not have the proper permissions.

  • Make sure the BigQuery service account has BigQuery User and BigQuery Data Editor roles or equivalent permissions as those two roles.
  • If the GCS staging mode is selected, ensure the BigQuery service account has the right permissions to the GCS bucket and path or the Cloud Storage Admin role, which includes a superset of the required permissions.

The HMAC key is wrong.

  • Make sure the HMAC key is created for the BigQuery service account, and the service account has permission to access the GCS bucket and path.

Tutorials

Now that you have set up the BigQuery destination connector, check out the following BigQuery tutorials:

Changelog

Expand to review
VersionDatePull RequestSubject
2.8.62024-07-3042511Added a copy operation to validate copy permissions in the check function
2.8.52024-07-2242407Batched Standard Inserts is default loading mode
2.8.42024-07-1541968Don't hang forever on empty stream list; shorten error message on INCOMPLETE stream status
2.8.32024-07-1241674Upgrade to latest CDK
2.8.22024-07-0841041Fix resume logic in truncate refreshes to prevent data loss
2.8.12024-06-2539379Removing requirement of a redundant permission bigquery.datasets.create permission
2.8.02024-06-2139904Convert all production code to kotlin
2.7.12024-06-1739526Internal code change for improved error reporting in case of source/platform failure (INCOMPLETE stream status / empty ConfiguredCatalog).
2.7.02024-06-1738713Support for refreshes and resumable full refresh. WARNING: You must upgrade to platform 0.63.7 before upgrading to this connector version.
2.6.32024-06-1038331Internal code changes in preparation for future feature release
2.6.22024-06-0738764Increase message length limit to 50MiB
2.6.12024-05-2938770Internal code change (switch to CDK artifact)
2.6.02024-05-2838359Propagate airbyte_meta from sources; add generation_id column
2.5.12024-05-2238591Bugfix to include forward-slash when cleaning up stage
2.5.02024-05-2238132Major rewrite of existing code, Adapting to CDK changes introduced in 38107
2.4.202024-05-1338131Cleanup BigQueryWriteConfig and reuse StreamConfig; Adapt to StreamConfig signature changes
2.4.192024-05-1038125adopt latest CDK code
2.4.182024-05-1038111No functional changes, deleting unused code
2.4.172024-05-0938098Internal build structure change
2.4.162024-05-0837714Adopt CDK 0.34.0
2.4.152024-05-0734611Adopt CDK 0.33.2
2.4.142024-02-2537584Remove unused insecure dependencies from CDK
2.4.132024-02-2536899adopt latest CDK
2.4.122024-03-0435315Adopt CDK 0.23.11
2.4.112024-02-2235569Fix logging bug.
2.4.102024-02-1535240Adopt CDK 0.20.9
2.4.92024-02-1535285Adopt CDK 0.20.8
2.4.82024-02-1235144Adopt CDK 0.20.2
2.4.72024-02-1235111Adopt CDK 0.20.1
2.4.62024-02-0934575Adopt CDK 0.20.0
2.4.52024-02-0834745Adopt CDK 0.19.0
2.4.42024-02-0835027Upgrade CDK to 0.17.1
2.4.32024-02-0134728Upgrade CDK to 0.16.4; Notable changes from 0.14.2, 0.15.1 and 0.16.3
2.4.22024-01-2434451Improve logging for unparseable input
2.4.12024-01-2434458Improve error reporting
2.4.02024-01-2434468Upgrade CDK to 0.14.0
2.3.312024-01-22#34023Combine DDL operations into a single execution
2.3.302024-01-12#34226Upgrade CDK to 0.12.0; Cleanup dependencies
2.3.292024-01-09#34003Fix loading credentials from GCP Env
2.3.282024-01-08#34021Add idempotency ids in dummy insert for check call
2.3.272024-01-05#33948Skip retrieving initial table state when setup fails
2.3.262024-01-04#33730Internal code structure changes
2.3.252023-12-20#33704Update to java CDK 0.10.0 (no changes)
2.3.242023-12-20#33697Stop creating unnecessary tmp tables
2.3.232023-12-18#33124Make Schema Creation Separate from Table Creation
2.3.222023-12-14#33451Remove old spec option
2.3.212023-12-13#33232Only run typing+deduping for a stream if the stream had any records
2.3.202023-12-08#33263Adopt java CDK version 0.7.0
2.3.192023-12-07#32326Update common T&D interfaces
2.3.182023-12-04#33084T&D SQL statements moved to debug log level
2.3.172023-12-04#33078Further increase gcs COPY timeout
2.3.162023-11-14#32526Clean up memory manager logs.
2.3.152023-11-13#32468Further error grouping enhancements
2.3.142023-11-06#32234Remove unused config option.
2.3.132023-11-08#32125fix compiler warnings
2.3.122023-11-08#32309Revert: Use Typed object for connection config
2.3.112023-11-07#32147Use Typed object for connection config
2.3.102023-11-07#32261Further improve error reporting
2.3.92023-11-07#32112GCS staging mode: reduce flush frequency to use rate limit more efficiently
2.3.82023-11-06#32026Move SAFE_CAST transaction to separate transactions
2.3.72023-11-06#32190Further improve error reporting
2.3.62023-11-06#32193Adopt java CDK version 0.4.1.
2.3.52023-11-02#31983Improve error reporting
2.3.42023-10-31#32010Add additional data centers.
2.3.32023-10-30#31985Delay upgrade deadline to Nov 7
2.3.22023-10-30#31960Adopt java CDK version 0.2.0.
2.3.12023-10-27#31529Performance enhancement (switch to a merge statement for incremental-dedup syncs)
2.3.02023-10-25#31686Opt out flag for typed and deduped tables
2.2.02023-10-25#31520Stop deduping raw table
2.1.62023-10-23#31717Remove inadvertent Destination v2 check
2.1.52023-10-17#30069Staging destination async
2.1.42023-10-17#31191Improve typing+deduping performance by filtering new raw records on extracted_at
2.1.32023-10-10#31358Stringify array and object types for type:string column in final table
2.1.22023-10-10#31194Deallocate unused per stream buffer memory when empty
2.1.12023-10-10#31083Fix precision of numeric values in async destinations
2.1.02023-10-09#31149No longer fail syncs when PKs are null - try do dedupe anyway
2.0.262023-10-09#31198Clarify configuration groups
2.0.252023-10-09#31185Increase staging file upload timeout to 5 minutes
2.0.242023-10-06#31139Bump CDK version
2.0.232023-10-06#31129Reduce async buffer size
2.0.222023-10-04#31082Revert null PK checks
2.0.212023-10-03#31028Update timeout
2.0.202023-09-26#30779Final table PK columns become non-null and skip check for null PKs in raw records (performance)
2.0.192023-09-26#30775Increase async block size
2.0.182023-09-27#30739Fix column name collision detection
2.0.172023-09-26#30696Attempt unsafe typing operations with an exception clause
2.0.162023-09-22#30697Improve resiliency to unclean exit during schema change
2.0.152023-09-21#30640Handle streams with identical name and namespace
2.0.142023-09-20#30069Staging destination async
2.0.132023-09-19#30592Internal code changes
2.0.122023-09-19#30319Improved testing
2.0.112023-09-18#30551GCS Staging is first loading method option
2.0.102023-09-15#30491Improve error message display
2.0.92023-09-14#30439Fix a transient error
2.0.82023-09-12#30364Add log message
2.0.72023-08-29#29878Internal code changes
2.0.62023-09-05#29917Improve performance by changing metadata error array construction from ARRAY_CONCAT to ARRAY_AGG
2.0.52023-08-31#30020Run typing and deduping tasks in parallel
2.0.42023-09-05#30117Type and Dedupe at sync start and then every 6 hours
2.0.32023-09-01#30056Internal refactor, no behavior change
2.0.22023-09-01#30120Improve performance on very wide streams by skipping SAFE_CAST on strings
2.0.12023-08-29#29972Publish a new version to supersede old v2.0.0
2.0.02023-08-27#29783Destinations V2
1.10.22023-08-24#29805Destinations v2: Don't soft reset in migration
1.10.12023-08-23#29774Destinations v2: Don't soft reset overwrite syncs
1.10.02023-08-21#29636Destinations v2: Several Critical Bug Fixes (cursorless dedup, improved floating-point handling, improved special characters handling; improved error handling)
1.9.12023-08-21#28687Under the hood: Add dependency on Java CDK v0.0.1.
1.9.02023-08-17#29560Destinations v2: throw an error on disallowed column name prefixes
1.8.12023-08-17#29522Migration BugFix - ensure raw dataset created
1.8.02023-08-17#29498Fix checkpointing logic in GCS staging mode
1.7.82023-08-15#29461Migration BugFix - ensure migration happens before table creation for GCS staging.
1.7.72023-08-11#29381Destinations v2: Add support for streams with no columns
1.7.62023-08-04#28894Destinations v2: Add v1 -> v2 migration Logic
1.7.52023-08-04#29106Destinations v2: handle unusual CDC deletion edge case
1.7.42023-08-04#29089Destinations v2: improve special character handling in column names
1.7.32023-08-03#28890Internal code updates; improved testing
1.7.22023-08-02#28976Fix composite PK handling in v1 mode
1.7.12023-08-02#28959Destinations v2: Fix CDC syncs in non-dedup mode
1.7.02023-08-01#28894Destinations v2: Open up early access program opt-in
1.6.02023-07-26#28723Destinations v2: Change raw table dataset and naming convention
1.5.82023-07-25#28721Destinations v2: Handle cursor change across syncs
1.5.72023-07-24#28625Destinations v2: Limit Clustering Columns to 4
1.5.62023-07-21#28580Destinations v2: Create dataset in user-specified location
1.5.52023-07-20#28490Destinations v2: Fix schema change detection in OVERWRITE mode when existing table is empty; other code refactoring
1.5.42023-07-17#28382Destinations v2: Schema Change Detection
1.5.32023-07-14#28345Increment patch to trigger a rebuild
1.5.22023-07-05#27936Internal scaffolding change for future development
1.5.12023-06-30#27891Revert bugged update
1.5.02023-06-27#27781License Update: Elv2
1.4.62023-06-28#27268Internal scaffolding change for future development
1.4.52023-06-21#27555Reduce image size
1.4.42023-05-25#26585Small tweak in logs for clarity
1.4.32023-05-17#26213Fix bug in parsing file buffer config count
1.4.22023-05-10#25925Testing update. Normalization tests are now done in the destination container.
1.4.12023-05-11#25993Internal library update
1.4.02023-04-29#25570Internal library update. Bumping version to stay in sync with BigQuery-denormalized.
1.3.42023-04-28#25588Internal scaffolding change for future development
1.3.32023-04-27#25346Internal code cleanup
1.3.12023-04-20#25097Internal scaffolding change for future development
1.3.02023-04-19#25287Add parameter to configure the number of file buffers when GCS is used as the loading method
1.2.202023-04-12#25122Add additional data centers
1.2.192023-03-29#24671Fail faster in certain error cases
1.2.182023-03-23#24447Set the Service Account Key JSON field to always_show: true so that it isn't collapsed into an optional fields section
1.2.172023-03-17#23788S3-Parquet: added handler to process null values in arrays
1.2.162023-03-10#23931Added support for periodic buffer flush
1.2.152023-03-10#23466Changed S3 Avro type from Int to Long
1.2.142023-02-08#22497Fixed table already exists error
1.2.132023-01-26#20631Added support for destination checkpointing with staging
1.2.122023-01-18#21087Wrap Authentication Errors as Config Exceptions
1.2.112023-01-18#21144Added explicit error message if sync fails due to a config issue
1.2.92022-12-14#20501Report GCS staging failures that occur during connection check
1.2.82022-11-22#19489Added non-billable projects handle to check connection stage
1.2.72022-11-11#19358Fixed check method to capture mismatch dataset location
1.2.62022-11-10#18554Improve check connection method to handle more errors
1.2.52022-10-19#18162Improve error logs
1.2.42022-09-26#16890Add user-agent header
1.2.32022-09-22#17054Respect stream namespace
1.2.12022-09-14#15668(bugged, do not use) Wrap logs in AirbyteLogMessage
1.2.02022-09-09#14023(bugged, do not use) Cover arrays only if they are nested
1.1.162022-09-01#16243Fix Json to Avro conversion when there is field name clash from combined restrictions (anyOf, oneOf, allOf fields)
1.1.152022-08-22#15787Throw exception if job failed
1.1.142022-08-03#14784Enabling Application Default Credentials
1.1.132022-08-02#14801Fix multiple log bindings
1.1.122022-08-02#15180Fix standard loading mode
1.1.112022-06-24#14114Remove "additionalProperties": false from specs for connectors with staging
1.1.102022-06-16#13852Updated stacktrace format for any trace message errors
1.1.92022-06-17#13753Deprecate and remove PART_SIZE_MB fields from connectors based on StreamTransferManager
1.1.82022-06-07#13579Always check GCS bucket for GCS loading method to catch invalid HMAC keys.
1.1.72022-06-07#13424Reordered fields for specification.
1.1.62022-05-15#12768Clarify that the service account key json field is required on cloud.
1.1.52022-05-12#12805Updated to latest base-java to emit AirbyteTraceMessage on error.
1.1.42022-05-04#12578In JSON to Avro conversion, log JSON field values that do not follow Avro schema for debugging.
1.1.32022-05-02#12528Update Dataset location field description
1.1.22022-04-29#12477Dataset location is a required field
1.1.12022-04-15#12068Fixed bug with GCS bucket conditional binding
1.1.02022-04-06#11776Use serialized buffering strategy to reduce memory consumption.
1.0.22022-03-30#11620Updated spec
1.0.12022-03-24#11350Improve check performance
1.0.02022-03-18#11238Updated spec and documentation
0.6.122022-03-18#10793Fix namespace with invalid characters
0.6.112022-03-03#10755Make sure to kill children threads and stop JVM
0.6.82022-02-14#10256Add -XX:+ExitOnOutOfMemoryError JVM option
0.6.62022-02-01#9959Fix null pointer exception from buffered stream consumer.
0.6.62022-01-29#9745Integrate with Sentry.
0.6.52022-01-18#9573BigQuery Destination : update description for some input fields
0.6.42022-01-17#8383Support dataset-id prefixed by project-id
0.6.32022-01-12#9415BigQuery Destination : Fix GCS processing of Facebook data
0.6.22022-01-10#9121Fixed check method for GCS mode to verify if all roles assigned to user
0.6.12021-12-22#9039Added part_size configuration to UI for GCS staging
0.6.02021-12-17#8788BigQuery/BiqQuery denorm Destinations : Add possibility to use different types of GCS files
0.5.12021-12-16#8816Update dataset locations
0.5.02021-10-26#7240Output partitioned/clustered tables
0.4.12021-10-04#6733Support dataset starting with numbers
0.4.02021-08-26#5296Added GCS Staging uploading option
0.3.122021-08-03#3549Add optional arg to make a possibility to change the BigQuery client's chunk\buffer size
0.3.112021-07-30#5125Enable additionalPropertities in spec.json
0.3.102021-07-28#3549Add extended logs and made JobId filled with region and projectId
0.3.92021-07-28#5026Add sanitized json fields in raw tables to handle quotes in column names
0.3.62021-06-18#3947Service account credentials are now optional.
0.3.42021-06-07#3277Add dataset location option