Transform
Run Transform
Once installed, the transform is run by calling the RUN() function in the installed Decode GA4 dataset.
Updating the transform_configs run_mode before executing the RUN function gives fine-grained control over transformation behaviour:
| RUN MODE | REFRESH DESCRIPTION |
|---|---|
full | All date partitions are refreshed. |
incremental | Only new date partitions, plus any modified date partitions within the past n days determined by automation.window_days. |
test | Only the first n date partitions, determined by the automation.window_days in each transform configuration. |
By default, on first execution the transform will run in full mode. Subsequent executions will run incrementally by default.
Note that incremental mode will only replace date partitions within the automation.window_days which have been modified, so will not just reprocess all dates regardless. This means that it might be desirable to set the window to be larger than the GA4-communicated time window for data modification of 72hrs.
As an aside, we have observed source data modifications significantly beyond 72hrs in source GA4 data.
Transform Configuration
The transform configuration is automatically generated as part of the installation process, and therefore does not need any modification to run Decode GA4 out of the box. This configuration is deployed as the function transform_configs() in the deployment_dataset_id dataset for reference and subsequent update and usage.
The transform_configs JSON array contains the transformation-level configurations, as well as information regarding the source and destination. You should not need to update this configuration unless you want to change any of the transformation configuration values.
External Table Configuration
This will process all identified dates, and export them to project, dataset, transform and date-segregated sub-folders in the specified GCS bucket, with an external table deployed in order to access the partitioned data in BigQuery.
[
{
"automation": {
"run_mode": "incremental",
"window_days": 3
},
"config": {
"transform_config_template": "events_external",
"transform_function": "events",
"transform_id": "events",
"transform_index": 0
},
"destination": {
"compression": "GZIP",
"dataset_id": "project_id.deployment_dataset_id",
"description": "base events external table, one row is a single event",
"format": "PARQUET",
"gcs_bucket_name": "ugg-data/ga4",
"granularity": [
"event_id"
],
"hive_partition_column_name": "partition_date",
"hive_partitioned": true,
"table_name": "events",
"table_type": "external"
},
"source": {
"dataset_id": "project_id.ga4_dataset_id",
"source_data_name": "source_data",
"table_prefix": "events_",
"table_type": "date_sharded"
}
}
]Partitioned Table Configuration
This will export all data to a date-partitioned table in the specified deployment dataset.
[
{
"automation": {
"run_mode": "incremental",
"window_days": 3
},
"config": {
"transform_index": 0,
"transform_config_template": "events_partitioned",
"transform_id": "events",
"transform_function": "events"
},
"destination": {
"dataset_id": "project_id.deployment_dataset_id",
"table_type": "date_partitioned",
"table_name": "events",
"description": "base events date-partitioned table, one row is a single event",
"granularity": ["event_id"],
"partition_expression": "event_date",
"clustering_column_list": ["event_name"]
},
"source": {
"dataset_id": "project_id.ga4_dataset_id",
"table_type": "date_sharded",
"table_prefix": "events_",
"source_id": "source_data"
}
}
]Run Modes
Updating the transform_configs run_mode before executing the RUN function gives fine-grained control over transformation behaviour:
| RUN MODE | REFRESH DESCRIPTION |
|---|---|
full | All date partitions are refreshed. |
incremental | Only new date partitions, plus any modified date partitions within the past n days determined by automation.window_days (default: 7 days) in each transform configuration. |
first | Only the first n date partitions, determined by the automation.window_days (default: 30 days) in each transform configuration. |
last | Only the first n date partitions, determined by the automation.window_days (default: 30 days) in each transform configuration. |
gapfill | Any date partitions identified in the source data which are not observed in the destination data. |
modified | Any new or modified date partitions in the source data. |
range | A specific date range, determined by the automation.start_date and automation.end_date in each transform configuration. |
Extension
Additional transformations can be included in the configuration by simply adding the configuration to the transform_configs JSON array, with the transformation logic defined in a date-bounded table-valued function (with start_date and end_date as the required arguments). This enables subsequent transformations to be executed and orchestrated without usage of any additional 3rd party tool.