Skip to content

BigQuery Functions

The following GoogleSQL functions are deployed to transform and augment inbound data. The original published article can be found on Medium here.

FORMAT_DATE

The FORMAT_DATE function is used to convert a DATE value into a STRING with a specific date syntax. In this case the '%Y%m%d' date syntax formats the output into a string in the format YYYYMMDD.

SELECT FORMAT_DATE('%Y%m%d', '2024-01-07')
> 20240107

PARSE_DATE

The PARSE_DATE function is the inverse of the FORMAT_DATE function, and is used to convert a STRING value into a DATE value, given an expected date syntax in the input STRING value. 

SELECT PARSE_DATE('%Y%m%d', '20240107')
> 2024-01-07

_TABLE_SUFFIX

The _TABLE_SUFFIX is technically a pseudo-column (a metadata column which is hidden but can be queried), which can be used in conjunction with a wildcard query and the FORMAT_DATE function to efficiently query the events_YYYYMMDD table. Note that this query assumes that the start_date and end_date DATE variables have been set as script variables or table function arguments.

SELECT 
_TABLE_SUFFIX AS table_suffix, *
FROM `[project_id].analytics_#########].events_*`
WHERE _TABLE_SUFFIX BETWEEN 
FORMAT_DATE("%Y%m%d", start_date) AND 
FORMAT_DATE("%Y%m%d", end_date)

This query syntax returns all source table columns with the _TABLE_SUFFIX as a STRING column called table_suffix before the rest of the columns.

TO_JSON_STRING

The TO_JSON_STRING function returns a JSON-formatted STRING representation of the data passed to it as an argument. This argument can be a variable, column value or the name of a preceding common table expression, as in the example below.

WITH
add_ids AS (
  SELECT 
  'project_id' AS project_id,
  'analytics_123456789' AS dataset_name)

SELECT 
TO_JSON_STRING(add_ids) AS add_ids_json
FROM add_ids
> {"project_id":"project_id","dataset_name":"analytics_123456789"}

NULL values are also preserved in the output, which makes it robust to unpredictable input data. This can be observed in the output from the example below.

WITH
add_ids AS (
  SELECT 
  'project_id' AS project_id,
  'analytics_123456789' AS dataset_name,
  NULL AS example_value)

SELECT 
TO_JSON_STRING(add_ids) AS add_ids_json
FROM add_ids
> {"project_id":"project_id","dataset_name":"analytics_123456789","example_value":null}

ARRAY and STRUCT values also directly map to JSON, so any BigQuery data structure can be represented in JSON.

Note that the function TO_JSON can be used interchangeably with this function, in which case the function will return a JSON value instead of a JSON-formatted STRING value. 

SHA256

The SHA256 function is a hashing function, which returns a 44 character BYTES value based on an input STRING or BYTES of any length. For a given input it will always return the same output, which makes it a useful function to check for uniqueness or identicality of inputs. Adding this to the example above gives a unique id for each row of data in the add_ids common table expression.

WITH
add_ids AS (
  SELECT 
  'project_id' AS project_id,
  'analytics_123456789' AS dataset_name,
  NULL AS example_value)

SELECT 
SHA256(TO_JSON_STRING(add_ids)) AS row_id
FROM add_ids
> XBbBYHTigXnQLJwZQ9dJ0y8TJexcWfhBiJWfp2pdVfE=

TO_HEX

The TO_HEX function is a function which converts a sequence of BYTES (as returned by the SHA256 function above) into a hexadecimal STRING. This means that the resulting STRING will only contain the characters 0..9 and a..f, making it cleaner for certain additional operations and copy/paste actions in the user interface.

WITH
add_ids AS (
  SELECT 
  'project_id' AS project_id,
  'analytics_123456789' AS dataset_name,
  NULL AS example_value)

SELECT 
TO_HEX(SHA256(TO_JSON_STRING(add_ids))) AS row_id
FROM add_ids
> 5c16c16074e28179d02c9c1943d749d32f1325ec5c59f84188959fa76a5d55f1

COALESCE

The COALESCE function returns the first non-null element from a sequence of values. Note that all elements in the sequence need to be coercible to a common supertype.

SELECT COALESCE (NULL,"value_x","value_y", NULL) AS value
> value_x

SAFE_CAST

The SAFE_CAST function is the SAFE version of the CAST function, meaning that if the input value cannot be CAST to the target data type, it will return NULL value instead of an error. Note that for many other functions this behaviour can be enforced by using the SAFE. prefix.

SELECT SAFE_CAST("value_x" AS INT64) AS value
> null

Conversely, if using the CAST function, the function will error.

SELECT CAST("value_x" AS INT64) AS value
> Query error: Bad int64 value: value_x at [1:1]

TIMESTAMP_MICROS

The TIMESTAMP_MICROS function converts a 16-digit unix timestamp (microseconds) to a TIMESTAMP.

SELECT TIMESTAMP_MICROS(1711353064567254) 
> 2024-03-25 07:51:04.567254 UTC

DATE

The DATE function extracts the DATE from a TIMESTAMP. Used in conjunction with the TIMESTAMP_MICROS function, it enables conversion from a unix timestamp to a DATE.

SELECT DATE(TIMESTAMP_MICROS(1711353064567254));
> 2024-03-25

TIME

The TIME function extracts the TIME (independent of the date) from a TIMESTAMP. Used in conjunction with the TIMESTAMP_MICROS function, it enables conversion from a unix timestamp to a TIME.

SELECT TIME(TIMESTAMP_MICROS(1711353064567254));
> 07:51:04.567254

REGEXP_CONTAINS

The REGEXP_CONTAINS function returns a BOOL, evaluating to TRUE if the input STRING matches the input regular expression:

SELECT REGEXP_CONTAINS('somebody@email.com', r'@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+') AS email_is_valid
> true

If the STRING does not match the regular expression, it will evaluate to FALSE.

SELECT REGEXP_CONTAINS('not_an_email.com', r'@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+') AS email_is_valid
> false

One important characteristic to note is that if either function argument is NULL, the expression will evaluate to NULL.

SELECT REGEXP_CONTAINS(NULL , r'@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+') AS email_is_valid
> null
SELECT REGEXP_CONTAINS('somebody@email.com', NULL) AS email_is_valid
> null