This project implements a data pipeline using Prefect for workflow orchestration, running on Google Kubernetes Engine (GKE). The infrastructure is managed with Terraform, and deployments are automated through GitHub Actions.
graph TD
%% Data Source
DS[NYC Taxi Data Source] -->|Download Parquet| PF[Prefect Flow]
subgraph "GitHub & CI/CD"
GH[GitHub Repository] -->|Push| GA[GitHub Actions]
GA -->|Build & Push| GCR[Google Container Registry]
GA -->|Deploy| TF[Terraform]
end
subgraph "Google Cloud Platform"
%% Kubernetes Cluster
subgraph "GKE Cluster"
subgraph "data-pipeline namespace"
PS[Prefect Server] -->|Create Jobs| PW[Prefect Worker]
PW -->|Run| PF
PF -->|Store Raw| GCS
PF -->|Store Processed| GCS2[GCS Processed]
PF -->|Load| BQ[BigQuery]
PF -.->|Future Use| SQL[Cloud SQL]
end
end
%% GCP Services
GCS[GCS Raw]
BQ
SQL
%% Looker Integration
BQ -->|Query Data| LK[Looker]
LK -->|Visualize| DB[Dashboards]
end
%% Infrastructure Management
TF -->|Create| GCS
TF -->|Create| BQ
TF -->|Create| SQL
%% Data Flow
classDef gcp fill:#4285F4,stroke:#333,stroke-width:2px,color:white;
classDef k8s fill:#326CE5,stroke:#333,stroke-width:2px,color:white;
classDef github fill:#24292E,stroke:#333,stroke-width:2px,color:white;
classDef flow fill:#00DB8B,stroke:#333,stroke-width:2px,color:black;
classDef visualization fill:#FF69B4,stroke:#333,stroke-width:2px,color:white;
class GCS,BQ,SQL,GKE gcp;
class PS,PW k8s;
class GH,GA github;
class PF flow;
class LK,DB visualization;
.
├── .github/
│ └── workflows/
│ ├── docker-build-push.yaml # CI/CD for Docker image
│ ├── infrastructure.yaml # Infrastructure deployment
│ └── terraform.yaml # Terraform automation
│
├── pipeline-project/
│ ├── config/ # Configuration files
│ │
│ │
│ ├── docs/ # Project documentation
│ │ ├── PrefectSetup.md
│ │ ├── README.md
│ │ └── Terraform.md
│ │
│ ├── k8s/ # Kubernetes configurations
│ │ └── base/
│ │ ├── cloudsql-proxy.yaml
│ │ ├── cloudsql-secret.yaml
│ │ ├── config.yaml
│ │ ├── connection-test-pod.yaml
│ │ ├── kustomization.yaml
│ │ ├── namespace.yaml
│ │ ├── prefect-rbac.yaml
│ │ ├── prefect-server.yaml
│ │ ├── prefect-worker.yaml
│ │ └── taxi-data-processing-job.yaml
│ │
│ ├── src/
│ │ └── processing/ # Data processing code
│ │ ├── flows/
│ │ │ ├── deploy.py # Prefect deployment script
│ │ │ └── taxi_data_flow.py # Main data flow
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ └── tests/
│ │ └── test_connections.py
│ │
│ └── terraform/ # Infrastructure as Code
│ ├── .terraform/
│ ├── modules/
│ │ ├── bigquery/ # BigQuery setup
│ │ ├── cloudsql/ # Cloud SQL setup
│ │ └── storage/ # GCS setup
│ ├── environments/
│ └── main.tf
-
Prefect Server
- Orchestrates workflows
- Provides UI for running and monitoring of the flows
- Runs in the
data-pipelinenamespace
-
Prefect Worker
- Executes workflow tasks
- Handles job creation in Kubernetes
- Runs in the
data-pipelinenamespace
-
Data Flow
- Downloads taxi data
- Processes data using Pandas
- Uploads to GCS
- Loads into BigQuery
-
Terraform
- Provisions and maintains our GCP resources through code
- Can be triggered via Github Actions manually
- Pipeline also runs automatically when there are changes in terraform directory
-
Kubernetes - GKE
- Orchestrates our data pipeline components (Prefect server, worker)
- Configurations are automatically validated through Github Actions
- Resources are defined as code in base directory and applied via kubectl
- Provides scalable and maintainable way to run our containerized applications
-
Github & Github Actions
- Pipeline for terraform infrastructure deployment
- Pipeline for validating Kubernetes configurations
- Pipeline for building and pushing Docker Image to Container Registry
graph TD
subgraph GKE-Cluster[GKE Cluster Info]
subgraph Cluster-Details[Cluster Details]
CN[Name: default-pool]
CV[Version: 1.30.8-gke.1051000]
CM[Machine: e2-medium]
CN2[Nodes: 2]
CAS[Autoscaling: 1-3 nodes per zone]
end
subgraph Namespace[data-pipeline namespace]
D1[Deployment: cloudsql-proxy] --> RS1[ReplicaSet: cloudsql-proxy]
D2[Deployment: prefect-server] --> RS2[ReplicaSet: prefect-server]
D3[Deployment: prefect-worker] --> RS3[ReplicaSet: prefect-worker]
RS1 --> P1[Pod: cloudsql-proxy]
RS2 --> P2[Pod: prefect-server]
RS3 --> P3[Pod: prefect-worker]
S1[Service: cloudsql-proxy] --> P1
S2[Service: prefect-server] --> P2
P3 --> J1[Job: abstract-badger-2d5xr]
J1 --> FP1[Pod: abstract-badger-2d5xr-b4fg2]
end
end
style GKE-Cluster fill:#e6e6e6,stroke:#2c3e50,stroke-width:2px,color:#2c3e50
style Cluster-Details fill:#d4e6f1,stroke:#2c3e50,stroke-width:1px,color:#2c3e50
style Namespace fill:#eaecee,stroke:#2c3e50,stroke-width:2px,color:#2c3e50
classDef clusterInfo fill:#5499c7,stroke:#2c3e50,color:white
classDef deployment fill:#2471a3,stroke:#1a5276,color:white
classDef replicaset fill:#27ae60,stroke:#196f3d,color:white
classDef pod fill:#f4d03f,stroke:#b7950b,color:black
classDef service fill:#8e44ad,stroke:#6c3483,color:white
classDef job fill:#c0392b,stroke:#922b21,color:white
classDef completedPod fill:#595959,stroke:#333333,color:white
%% Darken all arrows
linkStyle default stroke:#2c3e50,stroke-width:2px
class CN,CV,CM,CN2,CAS clusterInfo
class D1,D2,D3 deployment
class RS1,RS2,RS3 replicaset
class P1,P2,P3 pod
class S1,S2 service
class J1 job
class FP1 completedPod
The project uses GitHub Actions for:
- Building and pushing Docker images
- Deploying infrastructure changes
- Running static checks of Kubernetes yaml files
Workflows are triggered on:
- Push to main branch
- Pull requests
- Manual triggers
Environment variables are managed through Kubernetes ConfigMaps and Secrets in the data-pipeline namespace.
- Kubernetes service account:
prefect-worker - GCP service account with roles:
- Storage Admin
- BigQuery Data Editor
- Cloud SQL Client
- GCR User
- GCP credentials stored as Kubernetes secrets
- Database credentials managed through secrets
- Secret mounting handled via Kubernetes volumes
- Google Cloud Account
- GitHub account with necessary permissions
- Following tools installed locally for development:
- Google Cloud SDK
- kubectl
- Git
- Install Google Cloud SDK
- Run the command
gcloud versionto verify Google Cloud SDK is installed - Run
gcloud components install kubectl
- Run the command
- Run:
gcloud auth login - Run:
gcloud container clusters get-credentials cloud-computing-cluster --zone us-central1-c --project teak-gamma-442315-f8
You can now use kubernetes on our cluster from your local shell
# Apply base configurations
kubectl apply -k k8s/base-
Fork and clone the repository:
git clone https://github.com/ero67/Cloud-Computing-Project.git cd pipeline-project -
Set up GitHub Secrets:
GCP_SA_KEY: Your Google Cloud service account keyTF_VAR_db_password: Database password for TerraformGITHUB_TOKEN: For GitHub Actions
The infrastructure is automatically deployed through GitHub Actions when changes are pushed to main. The workflow:
- Validates Terraform configurations
- Plans infrastructure changes
- Applies changes automatically on
mainbranch
To trigger manual deployment:
- Go to GitHub Actions tab
- Select "Terraform CI/CD"
- Click "Run workflow"
The data pipeline code is automatically built and deployed when changes are pushed to the src/processing directory:
- GitHub Actions builds Docker image
- Pushes to Google Container Registry
- Updates Kubernetes deployments
To monitor deployment:
- Check GitHub Actions status
- Verify image in GCR:
gcloud container images list-tags gcr.io/teak-gamma-442315-f8/taxi-flow
-
Connect to the cluster:
gcloud container clusters get-credentials cloud-computing-cluster --zone us-central1-c --project teak-gamma-442315-f8
-
Access Prefect UI:
kubectl port-forward svc/prefect-server 4200:4200 -n data-pipeline
Open http://localhost:4200 in your browser
-
In Prefect UI:
- Go to "Work Pools" tab
- Click "+" to create new work pool
- Name: "k8s-pool"
- Type: Select "Kubernetes"
-
Configure Work Pool:
- Set Namespace: "data-pipeline"
- Set Service Account Name: "prefect-worker"
- Set Image: "gcr.io/teak-gamma-442315-f8/taxi-flow:latest"
- Set Image Pull Policy: "Always"
Simpler step would be copying contents of pipeline-project/config/workpool-backup file into base config of workpool in advanced settings.
-
Deploy the flow:
kubectl apply -f k8s/base/taxi-data-processing-job.yaml
-
Monitor flow deployment:
kubectl logs -f job/taxi-data-flow-job -n data-pipeline
-
Run the flow in Prefect UI:
- Go to "Deployments"
- Find "taxi-data-flow"
- Click "Run"
- Monitor execution in the "Flow runs" tab
-
Verify flow execution:
- Check flow logs in Prefect UI
- Verify data in GCS bucket
- Check BigQuery for loaded data
-
The flow will automatically run when:
- Manual trigger in Prefect UI
- Scheduled runs (if configured deployment is configured that way)
-
Monitor ongoing operations:
- Flow runs in Prefect UI
- Kubernetes pods status:
kubectl get pods -n data-pipeline
- Worker status:
kubectl logs -f -l app=prefect-worker -n data-pipeline
Now user can freely use data stored in BigQuery to visualize the data using Looker for example. Looker Dashboard
-
Create a new branch:
git checkout -b feature/new-feature
-
Make changes:
- Infrastructure changes in
terraform/ - Pipeline code in
src/processing/ - Kubernetes configs in
k8s/
- Infrastructure changes in
-
Push changes:
git push origin feature/new-feature
-
Create Pull Request:
- GitHub Actions will automatically:
- Validate Kubernetes configurations
- Run Terraform plan
- Build and test Docker image
- GitHub Actions will automatically:
-
Pipeline Issues:
kubectl logs -n data-pipeline <pod-name>
-
Work Pool Issues:
- Verify work pool configuration in UI
- Check worker logs:
kubectl logs -f -l app=prefect-worker -n data-pipeline
- Verify service account and secrets
-
Flow Run Issues:
- Check flow run logs in Prefect UI
- Verify GCP credentials mounting
- Check for permission issues in GCS/BigQuery
-
GitHub Actions Failures:
- Check Actions tab for detailed logs
- Verify secrets are properly set
- Check repository permissions
In this project we successfully implemented automated data pipeline in the cloud.
Since the idea of a project was clear from the start, all we had to do was to figure out how to implement all of the parts of the project and how to integrate them together while applying best Cloud Computing practises
We decided to host our essential component of our project Prefect in GKE (Google Kubernetes Engine).
This is the part where we applied first technology we learned in this course.
We needed to figure out how to run Prefect server and Prefect worker/agent in our Kubernetes Cluster.
We decided to run self-hosted Prefect server sicne it would avoid external dependency - Prefect Cloud.
After fair amount of research and debugging, we managed to successfully configure both services Prefect worker and Prefect server to communicate with each other.
The next challenge was setting up our infrastructure in a repeatable and maintainable way. We chose Terraform for this task as it aligned well with the course material and industry best practices.
We had to understand how to properly structure our Terraform code to manage different GCP resources. What modules to use ? This led us to organize our code into modules for different services (BigQuery, Cloud SQL, Storage).
To maintain good development practices, we implemented automated deployment pipelines using GitHub Actions.
We needed to figure out how to:
- Automatically build and push our Docker images
- Validate Kubernetes configurations
- Apply Terraform changes safely
- Handle secrets and credentials securely
Setting up pipeline for building and pushing Docker images was pretty straight forward since its a common type of CI/CD pipeline
For validating Kubernetes configurations we used kubeval and kubeconform.
We also created pipeline for applying Terraform changes and configuration. This is a great way to autoamte the process of creating all of the essential instances of services for our project.
We had to figure out how to deploy our flows to our prefect server and how to properly run the flow.
After research of how to use Prefect Kubernetes infrastructure we managed to implement the flow followingly:
- Initial deploy of flow using Kubernetes yaml file
- After deployment is done, user runs the flow for Prefect server UI running in our Kubernetes pod
- Worker processes request for running the flow and creates temporary Kubernetes job which runs the flow
- Flow is configured to always pull image from our GCR container registry and runs in the temporary pod created by the job.
- Kubernetes & Prefect Integration
- Self-hosted Prefect requires careful configuration of worker and server communication
- Proper roles for service accounts and secret management is crucial for secure operation
- Work pool configuration is essential for successful flow execution
- Infrastructure Management
- Terraform automation through GitHub Actions ensures consistent deployments
- Breaking infrastructure into modules (storage, database, compute) improves maintainability
- CI/CD Pipeline
- Automated validation prevents misconfiguration
- Regular testing of infrastructure changes reduces deployment issues
- Keeping secrets secure while maintaining automation requires careful planning
- Using GKE for orchestration
- Implementing infrastructure as code
- Automating deployments with GitHub Actions
- Self-hosting Prefect for better control
- Implement monitoring (via Grafana)
- CI/CD Pipeline for running kubernetes commands and deployments could be added
- More testing
- Automatic work-pool creation via kubernetes