Set up Probabilistic IDR (Google Cloud Storage)

Overview

Probabilistic matching does some of its work outside your data warehouse, so it needs a place in your cloud to store data during the process. Hightouch uses it as a workspace during each graph run — reading your source records, running the matching process, and writing results back to your data warehouse. All data stays in your cloud account and under your control.

Some data persists in the bucket between runs so Hightouch doesn't have to start from scratch each time.

Step 1: Create a custom IAM role

Create a new role in your GCP project with the following permissions:

storage.buckets.get
storage.objects.list
storage.objects.create
storage.objects.get
storage.objects.delete

gcloud iam roles create YOUR_CUSTOM_ROLE_NAME \
  --project=YOUR_PROJECT_ID \
  --title="YOUR CUSTOM ROLE NAME" \
  --description="Custom role for Hightouch probabilistic IDR" \
  --permissions="storage.buckets.get,storage.objects.list,storage.objects.create,storage.objects.get,storage.objects.delete"

Step 2: Grant the role to the Hightouch service account

Hightouch will provide a service account specifically for probabilistic IDR.

Once you have the service account email, grant it the custom role:

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member=serviceAccount:HT_PROVIDED_SERVICE_ACCOUNT@YOUR_DOMAIN.iam.gserviceaccount.com \
  --role=YOUR_CUSTOM_ROLE_NAME

Step 3: Create a GCS bucket

Create a GCS bucket with the following requirements:

Located in the same cloud provider and region as your Hightouch workspace
Must not have object lifecycle rules that delete or expire objects in the following path:

/workspace-$WORKSPACE_ID/datalake

Share the bucket name with your Hightouch team so they can complete setup on their side.

Step 5: Grant bucket access to the BigQuery service account

Grant the same custom IAM role from Step 1 to the BigQuery service account used by Hightouch in your project.

This ensures Hightouch can read and write IDR data as part of warehouse workflows.

FAQs

Can I reuse an existing GCS bucket?

Yes. You can use the same GCS bucket you use for self-hosted external storage, as long as the bucket does not have lifecycle rules that delete objects in the required path.

Note that the service account used for probabilistic IDR is different from the service account generated by the Hightouch app or one you may already use.