GCP Provider
Status: ✅ Production Ready
Version: 1.0.0
Services: BigQuery, Cloud Storage, IAM, Cloud Run, Pub/Sub
Overview
The Google Cloud Platform provider is the flagship Fluid Forge implementation, offering production-grade support for BigQuery, Cloud Storage, and comprehensive GCP services.
Why GCP?
- Serverless Analytics - BigQuery eliminates infrastructure management
- Cost-Effective - Pay-per-query pricing with generous free tier
- Enterprise Scale - Petabyte-scale analytics out of the box
- ML Integration - Native BigQuery ML and Vertex AI
- Security Built-In - Column-level security, data masking, audit logs
Quick Start
Prerequisites
# Install gcloud SDK
curl https://sdk.cloud.google.com | bash
# Authenticate
gcloud auth application-default login
# Set project
gcloud config set project YOUR_PROJECT_ID
# Enable APIs
gcloud services enable bigquery.googleapis.com
gcloud services enable storage.googleapis.com
Minimal Contract
fluidVersion: "0.7"
kind: DataProduct
exposeId: my-data-product
binding:
provider: gcp
project: my-project-id
region: us-central1
exposes:
- type: dataset
name: analytics
tables:
- name: customers
schema:
- name: id
type: INTEGER
required: true
- name: name
type: STRING
Deploy:
fluid apply contract.yaml --provider gcp
Generate Orchestration Code:
# Generate Airflow DAG
fluid generate-airflow contract.yaml -o dags/my_pipeline.py
# Export to Dagster
fluid export contract.yaml --engine dagster -o pipelines/
# Export to Prefect
fluid export contract.yaml --engine prefect -o flows/
Supported Features
✅ BigQuery
| Feature | Support | Notes |
|---|---|---|
| Datasets | ✅ Full | Multi-region, labels, access control |
| Tables | ✅ Full | Partitioning, clustering, expiration |
| Views | ✅ Full | Standard and materialized views |
| External Tables | ✅ Full | GCS, Google Sheets, Bigtable |
| Routines | ✅ Full | UDFs, stored procedures |
| Authorized Views | ✅ Full | Fine-grained access control |
| Policy Tags | ✅ Full | Column-level security (Phase 1-3) |
| Data Masking | ✅ Full | Dynamic data masking |
| Row-Level Security | 🔜 Q2 2026 | RLS policies |
✅ Cloud Storage
| Feature | Support | Notes |
|---|---|---|
| Buckets | ✅ Full | Multi-region, versioning |
| Objects | ✅ Full | Upload, download, lifecycle |
| Lifecycle Policies | ✅ Full | Auto-delete, archival |
| Signed URLs | ✅ Full | Temporary access |
| Notifications | ✅ Full | Pub/Sub integration |
✅ Airflow DAG Generation (v0.7.1)
| Feature | Support | Notes |
|---|---|---|
| Airflow DAGs | ✅ Full | Cloud Composer compatible |
| BigQuery Operators | ✅ Full | Query, table, dataset, view operations |
| GCS Operators | ✅ Full | Bucket and object management |
| Pub/Sub Operators | ✅ Full | Topic and subscription operations |
| Dataflow Operators | ✅ Full | Beam pipeline execution |
| Contract Validation | ✅ Full | Structure checks + circular dependency detection |
| Dagster Pipelines | ✅ Full | Type-safe ops with resources |
| Prefect Flows | ✅ Full | Retry logic and deployment configs |
Performance:
- Average generation time: 0.8-2ms
- Average output size: 2-10KB
- Test coverage: 100% (all provider tests passing)
✅ IAM & Security
| Feature | Support | Notes |
|---|---|---|
| Service Accounts | ✅ Full | Auto-creation, key management |
| IAM Bindings | ✅ Full | Least-privilege access |
| Policy Tags | ✅ Full | Taxonomy management |
| Audit Logs | ✅ Full | Admin, data access logs |
| VPC Service Controls | 🔜 Q2 2026 | Network isolation |
⏳ Cloud Run (Preview)
| Feature | Support | Notes |
|---|---|---|
| Services | ✅ Beta | Container deployment |
| Jobs | ✅ Beta | Batch processing |
| Auto-scaling | ✅ Beta | Request-based scaling |
| Custom Domains | 🔜 Q2 2026 | HTTPS endpoints |
Configuration
Provider Settings
platform:
provider: gcp
# Required
project: my-project-id
region: us-central1
# Optional
location: US # BigQuery multi-region (US, EU)
zone: us-central1-a # Specific zone for compute
# Cost controls
cost_controls:
enable_bi_engine: true # Query acceleration
bi_engine_gb: 10 # BI Engine cache size
default_table_expiration_days: 365
max_bytes_billed: 10000000000 # 10 GB query limit
# Networking
network:
vpc: default
subnet: default
private_google_access: true
# Labels (applied to all resources)
labels:
environment: production
team: data-engineering
cost-center: analytics
BigQuery Best Practices
Partitioning
Partition tables by date for performance and cost savings:
tables:
- name: events
partitioning:
field: event_timestamp
type: DAY # or HOUR, MONTH, YEAR
require_partition_filter: true # Enforce partitioned queries
expiration_days: 90 # Auto-delete old partitions
Cost savings: Up to 90% reduction for time-based queries
Clustering
Cluster columns for better query performance:
tables:
- name: events
clustering:
fields: [user_id, event_type, country] # Max 4 fields
Performance: Up to 10x faster queries on clustered columns
Materialized Views
Pre-compute aggregations:
tables:
- name: daily_metrics
materialized: true
query: |
SELECT
DATE(event_timestamp) as date,
user_id,
COUNT(*) as event_count,
SUM(revenue) as total_revenue
FROM `${project}.events.raw_events`
GROUP BY date, user_id
# Refresh settings
refresh:
enabled: true
interval_minutes: 60 # Refresh hourly
Benefit: Sub-second queries on complex aggregations
Security & Governance
Column-Level Security
Protect sensitive data with policy tags:
governance:
policy_tags:
- taxonomy: data_classification
tags:
- name: PII
description: Personally Identifiable Information
columns: [email, phone, ssn]
- name: Financial
description: Financial data
columns: [salary, credit_card]
tables:
- name: customers
schema:
- name: email
type: STRING
policy_tag: PII # Restricted access
- name: name
type: STRING # No policy tag = public
IAM Integration:
# Grant access to PII data
gcloud data-catalog taxonomies add-iam-policy-binding \
data_classification \
--member="user:analyst@company.com" \
--role="roles/datacatalog.categoryFineGrainedReader"
Data Masking
Automatically mask sensitive data:
governance:
data_masking:
- column: email
masking_type: DEFAULT # user@example.com → u***@e***.com
policy_tag: PII
- column: credit_card
masking_type: SHA256 # One-way hash
policy_tag: Financial
Access Control
Define granular permissions:
access:
dataset_access:
- role: READER
members:
- user:analyst@company.com
- group:data-analysts@company.com
- domain:company.com # Everyone in domain
- role: WRITER
members:
- serviceAccount:etl@project.iam.gserviceaccount.com
- role: OWNER
members:
- user:data-admin@company.com
table_access:
- table: customers
role: READER
members:
- user:marketing@company.com
Loading Data
From Cloud Storage
tables:
- name: sales
load_from_gcs:
uri: gs://my-bucket/data/*.csv
format: CSV
schema_auto_detect: true
skip_leading_rows: 1
allow_jagged_rows: false
encoding: UTF-8
From Local Files
# Use bq CLI for one-time loads
bq load \
--source_format=CSV \
--skip_leading_rows=1 \
my_dataset.my_table \
data/file.csv
Streaming Inserts
from google.cloud import bigquery
client = bigquery.Client()
table_id = "project.dataset.table"
rows = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25}
]
errors = client.insert_rows_json(table_id, rows)
if not errors:
print("Rows inserted successfully")
Cost Optimization
Query Optimization
-- ❌ BAD: Scans entire table
SELECT * FROM `project.dataset.events`
WHERE DATE(event_time) = '2026-01-20'
-- ✅ GOOD: Uses partition filter
SELECT * FROM `project.dataset.events`
WHERE event_time >= '2026-01-20'
AND event_time < '2026-01-21'
Storage Classes
buckets:
- name: analytics-archive
storage_class: NEARLINE # For infrequent access
lifecycle:
- action: SetStorageClass
storage_class: COLDLINE
age_days: 90 # Move to coldline after 90 days
- action: Delete
age_days: 365 # Delete after 1 year
Cost Monitoring
# Check current month costs
bq query --use_legacy_sql=false \
'SELECT
SUM(total_bytes_processed) / POW(10, 12) as tb_processed,
SUM(total_bytes_processed) / POW(10, 12) * 5 as estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE DATE(creation_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)'
Advanced Features
BigQuery ML
Train models directly in BigQuery:
routines:
- name: churn_prediction_model
type: ML_MODEL
training_query: |
CREATE OR REPLACE MODEL `${project}.${dataset}.churn_model`
OPTIONS(
model_type='LOGISTIC_REG',
input_label_cols=['churned']
) AS
SELECT
* EXCEPT(customer_id)
FROM `${project}.${dataset}.customer_features`
Authorized Views
Share data without granting direct access:
views:
- name: public_customer_summary
authorized: true # Can access source tables user can't see
authorized_datasets:
- project: partner-project
dataset: shared_data
query: |
SELECT
customer_id,
total_purchases,
avg_order_value
-- Excludes PII like email, name
FROM `${project}.${dataset}.customers`
Monitoring
Built-in Metrics
Fluid Forge automatically exports metrics:
monitoring:
enabled: true
metrics:
- name: query_performance
query: |
SELECT
AVG(total_slot_ms) as avg_slot_ms,
MAX(total_bytes_processed) as max_bytes
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE DATE(creation_time) = CURRENT_DATE()
alerts:
- name: high_query_cost
condition: max_bytes > 10000000000 # 10 GB
notification: slack://data-team
Troubleshooting
"Access Denied" Errors
Grant yourself BigQuery Admin:
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="user:YOUR_EMAIL" \
--role="roles/bigquery.admin"
"Quota Exceeded"
Request quota increase:
gcloud services quota list \
--service=bigquery.googleapis.com \
--consumer="projects/PROJECT_ID"
Slow Queries
Enable query plan visualization:
-- Add to query
OPTIONS(use_query_cache=false)
-- View execution plan
SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE job_id = 'YOUR_JOB_ID'
Limitations
- Max dataset size: Unlimited
- Max table size: 10 TB (contact support for larger)
- Max query size: 100 KB SQL text
- Max columns: 10,000 per table
- Max concurrent queries: 100 (can be increased)
- Query timeout: 6 hours (distributed queries)
Roadmap
Q2 2026
- ✅ Row-Level Security (RLS) policies
- ✅ Dataflow integration
- ✅ Cloud Composer orchestration
- ✅ VPC Service Controls
Q3 2026
- ✅ BigQuery Omni (multi-cloud)
- ✅ Data transfer service automation
- ✅ Advanced BI Engine features
- ✅ Cross-project analytics
Next Steps
- Getting Started - First GCP deployment
- GCP Walkthrough - Hands-on tutorial
- CLI Reference - GCP-specific commands
- Governance Guide - Security deep-dive
GCP Provider maintained by the Fluid Forge core team
