Fluid ForgeFluid Forge
Home
Get Started
  • Local (DuckDB)
  • GCP (BigQuery)
  • Snowflake Team Collaboration
  • Declarative Airflow
  • Orchestration Export
  • Jenkins CI/CD
  • Universal Pipeline
CLI Reference
  • Overview
  • Architecture
  • GCP (BigQuery)
  • AWS (S3 + Athena)
  • Snowflake
  • Local (DuckDB)
  • Custom Providers
  • Roadmap
GitHub
GitHub
Home
Get Started
  • Local (DuckDB)
  • GCP (BigQuery)
  • Snowflake Team Collaboration
  • Declarative Airflow
  • Orchestration Export
  • Jenkins CI/CD
  • Universal Pipeline
CLI Reference
  • Overview
  • Architecture
  • GCP (BigQuery)
  • AWS (S3 + Athena)
  • Snowflake
  • Local (DuckDB)
  • Custom Providers
  • Roadmap
GitHub
GitHub
  • Introduction

    • /
    • Getting Started
    • Snowflake Quickstart
    • Vision & Roadmap
  • Walkthroughs

    • Walkthrough: Local Development
    • Walkthrough: Deploy to Google Cloud Platform
    • Walkthrough: Snowflake Team Collaboration
    • Declarative Airflow DAG Generation - The FLUID Way
    • Generating Orchestration Code from Contracts
    • Jenkins CI/CD for FLUID Data Products
    • Universal Pipeline
  • CLI Reference

    • CLI Reference
    • init Command
    • validate Command
    • plan Command
    • apply Command
    • verify Command
    • generate-airflow Command
  • Providers

    • Providers
    • Provider Architecture
    • GCP Provider
    • AWS Provider
    • Snowflake Provider
    • Local Provider
    • Creating Custom Providers
    • Provider Roadmap
  • Advanced

    • Blueprints
    • Governance & Compliance
    • Airflow Integration
    • Built-in And Custom Forge Agents
    • FLUID Forge Contract GPT Packet
    • Forge Copilot Discovery Guide
    • Forge Copilot Memory Guide
  • Project

    • Contributing to Fluid Forge
    • Fluid Forge v0.7.1 - Multi-Provider Export Release

GCP Provider

Status: ✅ Production Ready
Version: 1.0.0
Services: BigQuery, Cloud Storage, IAM, Cloud Run, Pub/Sub


Overview

The Google Cloud Platform provider is the flagship Fluid Forge implementation, offering production-grade support for BigQuery, Cloud Storage, and comprehensive GCP services.

Why GCP?

  • Serverless Analytics - BigQuery eliminates infrastructure management
  • Cost-Effective - Pay-per-query pricing with generous free tier
  • Enterprise Scale - Petabyte-scale analytics out of the box
  • ML Integration - Native BigQuery ML and Vertex AI
  • Security Built-In - Column-level security, data masking, audit logs

Quick Start

Prerequisites

# Install gcloud SDK
curl https://sdk.cloud.google.com | bash

# Authenticate
gcloud auth application-default login

# Set project
gcloud config set project YOUR_PROJECT_ID

# Enable APIs
gcloud services enable bigquery.googleapis.com
gcloud services enable storage.googleapis.com

Minimal Contract

fluidVersion: "0.7"
kind: DataProduct
exposeId: my-data-product

binding:
  provider: gcp
  project: my-project-id
  region: us-central1

exposes:
  - type: dataset
    name: analytics
    
    tables:
      - name: customers
        schema:
          - name: id
            type: INTEGER
            required: true
          - name: name
            type: STRING

Deploy:

fluid apply contract.yaml --provider gcp

Generate Orchestration Code:

# Generate Airflow DAG
fluid generate-airflow contract.yaml -o dags/my_pipeline.py

# Export to Dagster
fluid export contract.yaml --engine dagster -o pipelines/

# Export to Prefect
fluid export contract.yaml --engine prefect -o flows/

Supported Features

✅ BigQuery

FeatureSupportNotes
Datasets✅ FullMulti-region, labels, access control
Tables✅ FullPartitioning, clustering, expiration
Views✅ FullStandard and materialized views
External Tables✅ FullGCS, Google Sheets, Bigtable
Routines✅ FullUDFs, stored procedures
Authorized Views✅ FullFine-grained access control
Policy Tags✅ FullColumn-level security (Phase 1-3)
Data Masking✅ FullDynamic data masking
Row-Level Security🔜 Q2 2026RLS policies

✅ Cloud Storage

FeatureSupportNotes
Buckets✅ FullMulti-region, versioning
Objects✅ FullUpload, download, lifecycle
Lifecycle Policies✅ FullAuto-delete, archival
Signed URLs✅ FullTemporary access
Notifications✅ FullPub/Sub integration

✅ Airflow DAG Generation (v0.7.1)

FeatureSupportNotes
Airflow DAGs✅ FullCloud Composer compatible
BigQuery Operators✅ FullQuery, table, dataset, view operations
GCS Operators✅ FullBucket and object management
Pub/Sub Operators✅ FullTopic and subscription operations
Dataflow Operators✅ FullBeam pipeline execution
Contract Validation✅ FullStructure checks + circular dependency detection
Dagster Pipelines✅ FullType-safe ops with resources
Prefect Flows✅ FullRetry logic and deployment configs

Performance:

  • Average generation time: 0.8-2ms
  • Average output size: 2-10KB
  • Test coverage: 100% (all provider tests passing)

✅ IAM & Security

FeatureSupportNotes
Service Accounts✅ FullAuto-creation, key management
IAM Bindings✅ FullLeast-privilege access
Policy Tags✅ FullTaxonomy management
Audit Logs✅ FullAdmin, data access logs
VPC Service Controls🔜 Q2 2026Network isolation

⏳ Cloud Run (Preview)

FeatureSupportNotes
Services✅ BetaContainer deployment
Jobs✅ BetaBatch processing
Auto-scaling✅ BetaRequest-based scaling
Custom Domains🔜 Q2 2026HTTPS endpoints

Configuration

Provider Settings

platform:
  provider: gcp
  
  # Required
  project: my-project-id
  region: us-central1
  
  # Optional
  location: US  # BigQuery multi-region (US, EU)
  zone: us-central1-a  # Specific zone for compute
  
  # Cost controls
  cost_controls:
    enable_bi_engine: true  # Query acceleration
    bi_engine_gb: 10  # BI Engine cache size
    default_table_expiration_days: 365
    max_bytes_billed: 10000000000  # 10 GB query limit
  
  # Networking
  network:
    vpc: default
    subnet: default
    private_google_access: true
  
  # Labels (applied to all resources)
  labels:
    environment: production
    team: data-engineering
    cost-center: analytics

BigQuery Best Practices

Partitioning

Partition tables by date for performance and cost savings:

tables:
  - name: events
    partitioning:
      field: event_timestamp
      type: DAY  # or HOUR, MONTH, YEAR
      require_partition_filter: true  # Enforce partitioned queries
      expiration_days: 90  # Auto-delete old partitions

Cost savings: Up to 90% reduction for time-based queries

Clustering

Cluster columns for better query performance:

tables:
  - name: events
    clustering:
      fields: [user_id, event_type, country]  # Max 4 fields

Performance: Up to 10x faster queries on clustered columns

Materialized Views

Pre-compute aggregations:

tables:
  - name: daily_metrics
    materialized: true
    
    query: |
      SELECT 
        DATE(event_timestamp) as date,
        user_id,
        COUNT(*) as event_count,
        SUM(revenue) as total_revenue
      FROM `${project}.events.raw_events`
      GROUP BY date, user_id
    
    # Refresh settings
    refresh:
      enabled: true
      interval_minutes: 60  # Refresh hourly

Benefit: Sub-second queries on complex aggregations


Security & Governance

Column-Level Security

Protect sensitive data with policy tags:

governance:
  policy_tags:
    - taxonomy: data_classification
      tags:
        - name: PII
          description: Personally Identifiable Information
          columns: [email, phone, ssn]
        
        - name: Financial
          description: Financial data
          columns: [salary, credit_card]

tables:
  - name: customers
    schema:
      - name: email
        type: STRING
        policy_tag: PII  # Restricted access
      
      - name: name
        type: STRING  # No policy tag = public

IAM Integration:

# Grant access to PII data
gcloud data-catalog taxonomies add-iam-policy-binding \
  data_classification \
  --member="user:analyst@company.com" \
  --role="roles/datacatalog.categoryFineGrainedReader"

Data Masking

Automatically mask sensitive data:

governance:
  data_masking:
    - column: email
      masking_type: DEFAULT  # user@example.com → u***@e***.com
      policy_tag: PII
    
    - column: credit_card
      masking_type: SHA256  # One-way hash
      policy_tag: Financial

Access Control

Define granular permissions:

access:
  dataset_access:
    - role: READER
      members:
        - user:analyst@company.com
        - group:data-analysts@company.com
        - domain:company.com  # Everyone in domain
    
    - role: WRITER
      members:
        - serviceAccount:etl@project.iam.gserviceaccount.com
    
    - role: OWNER
      members:
        - user:data-admin@company.com
  
  table_access:
    - table: customers
      role: READER
      members:
        - user:marketing@company.com

Loading Data

From Cloud Storage

tables:
  - name: sales
    load_from_gcs:
      uri: gs://my-bucket/data/*.csv
      format: CSV
      schema_auto_detect: true
      skip_leading_rows: 1
      allow_jagged_rows: false
      encoding: UTF-8

From Local Files

# Use bq CLI for one-time loads
bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  my_dataset.my_table \
  data/file.csv

Streaming Inserts

from google.cloud import bigquery

client = bigquery.Client()
table_id = "project.dataset.table"

rows = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25}
]

errors = client.insert_rows_json(table_id, rows)
if not errors:
    print("Rows inserted successfully")

Cost Optimization

Query Optimization

-- ❌ BAD: Scans entire table
SELECT * FROM `project.dataset.events`
WHERE DATE(event_time) = '2026-01-20'

-- ✅ GOOD: Uses partition filter
SELECT * FROM `project.dataset.events`
WHERE event_time >= '2026-01-20'
  AND event_time < '2026-01-21'

Storage Classes

buckets:
  - name: analytics-archive
    storage_class: NEARLINE  # For infrequent access
    
    lifecycle:
      - action: SetStorageClass
        storage_class: COLDLINE
        age_days: 90  # Move to coldline after 90 days
      
      - action: Delete
        age_days: 365  # Delete after 1 year

Cost Monitoring

# Check current month costs
bq query --use_legacy_sql=false \
  'SELECT 
    SUM(total_bytes_processed) / POW(10, 12) as tb_processed,
    SUM(total_bytes_processed) / POW(10, 12) * 5 as estimated_cost_usd
  FROM `region-us`.INFORMATION_SCHEMA.JOBS
  WHERE DATE(creation_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)'

Advanced Features

BigQuery ML

Train models directly in BigQuery:

routines:
  - name: churn_prediction_model
    type: ML_MODEL
    
    training_query: |
      CREATE OR REPLACE MODEL `${project}.${dataset}.churn_model`
      OPTIONS(
        model_type='LOGISTIC_REG',
        input_label_cols=['churned']
      ) AS
      SELECT 
        * EXCEPT(customer_id)
      FROM `${project}.${dataset}.customer_features`

Authorized Views

Share data without granting direct access:

views:
  - name: public_customer_summary
    authorized: true  # Can access source tables user can't see
    
    authorized_datasets:
      - project: partner-project
        dataset: shared_data
    
    query: |
      SELECT 
        customer_id,
        total_purchases,
        avg_order_value
        -- Excludes PII like email, name
      FROM `${project}.${dataset}.customers`

Monitoring

Built-in Metrics

Fluid Forge automatically exports metrics:

monitoring:
  enabled: true
  
  metrics:
    - name: query_performance
      query: |
        SELECT 
          AVG(total_slot_ms) as avg_slot_ms,
          MAX(total_bytes_processed) as max_bytes
        FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
        WHERE DATE(creation_time) = CURRENT_DATE()
    
  alerts:
    - name: high_query_cost
      condition: max_bytes > 10000000000  # 10 GB
      notification: slack://data-team

Troubleshooting

"Access Denied" Errors

Grant yourself BigQuery Admin:

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:YOUR_EMAIL" \
  --role="roles/bigquery.admin"

"Quota Exceeded"

Request quota increase:

gcloud services quota list \
  --service=bigquery.googleapis.com \
  --consumer="projects/PROJECT_ID"

Slow Queries

Enable query plan visualization:

-- Add to query
OPTIONS(use_query_cache=false)

-- View execution plan
SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE job_id = 'YOUR_JOB_ID'

Limitations

  • Max dataset size: Unlimited
  • Max table size: 10 TB (contact support for larger)
  • Max query size: 100 KB SQL text
  • Max columns: 10,000 per table
  • Max concurrent queries: 100 (can be increased)
  • Query timeout: 6 hours (distributed queries)

Roadmap

Q2 2026

  • ✅ Row-Level Security (RLS) policies
  • ✅ Dataflow integration
  • ✅ Cloud Composer orchestration
  • ✅ VPC Service Controls

Q3 2026

  • ✅ BigQuery Omni (multi-cloud)
  • ✅ Data transfer service automation
  • ✅ Advanced BI Engine features
  • ✅ Cross-project analytics

Next Steps

  • Getting Started - First GCP deployment
  • GCP Walkthrough - Hands-on tutorial
  • CLI Reference - GCP-specific commands
  • Governance Guide - Security deep-dive

GCP Provider maintained by the Fluid Forge core team

Edit this page on GitHub
Last Updated: 3/12/26, 1:03 PM
Contributors: khanya_ai
Prev
Provider Architecture
Next
AWS Provider