Overview
Objectives
The Google Analytics 4 export to BigQuery has a complex structure which is extremely difficult to query in SQL, making analytics tasks time-consuming and SQL code extremely long, difficult to understand and error-prone.
Decode GA4 simplifies and augments the GA4 BigQuery export, making subsequent data operations simpler, quicker and cheaper.
It automates pre-modelling of the GA4 BigQuery events_YYYYMMDD
export, providing flattened, date-partitioned events data in a compressed file format in Google Cloud Storage, and a corresponding BigQuery External table for reading data into BigQuery. The underlying folder structure is hive-partitioned, supporting efficient querying on a date-filtered basis.
Philosophy
Data Engineering and modelling is a broad and complex field, with no clearly agreed-upon methods for determining which activities take place at which stage in the data lifecycle. Our philosophy is that clear separation of activities is extremely valuable when developing any pipeline to deliver verified data to downstream consumers. A few of the practical decisions which have informed this solution are outlined below:
Structure
- Build segregation between pipeline stages - separating source data, transformation functions, base data and transformed data into different datasets to support clean boundaries and support data quality assurance.
- Build transformation logic into idempotent functions - for stability, predictability and recoverability of data.
- Automate structural profiling - to derive the transformation configuration from the observed data profile, minimizing the need for human interaction and manual configuration updates.
- Standardise core transforms - building core transformations from a single, standardised base transformation function, simplifying structure, readability and maintainability.
- Auto-generate custom transforms - to automatically profile data and build custom transform functions for decoding complex nested structures.
Deployment
- Leverage metadata-driven automation logic - for cost efficiency, performance and scalability.
- Elegantly handle schema updates - to manually or automatically update to the latest schema upon change, to incorporate new changes quickly without manual code update or risk of breaking downstream modelling.
- Design for configurability - with clean, well-documented defaults and configuration options to cover all potential requirements.
- Design for interoperability - using a standardized file formats and existing interfaces to simplify integration and cross-system workflows.
- Design for extensibility - with an execution structure which can be trivially extended with additional transformation or export stages.
Output
This results in a very concise, clean code structure and architecture, where the base events
transformation function is less than 100 lines of SQL, along with custom, auto-generated User Defined Functions to decode the event_params
, user_properties
and item_params
, and an event_count
function to simplify event metric comparison and conversions.
The local
struct automatically adjusts event timestamps to the local time - derived from the timezone of the identified city, region and country in the GA4 data - supporting more nuanced behavioural analysis and targeting by local time of day.
Source data is transformed, augmented, compressed and exported to date-partitioned folders in a Google Cloud Storage bucket, and an external table is created in BigQuery to enable efficiently and cost-effective querying of the base events
data.