This repository contains the configuration for running hazard modeling scripts as a containerized application on Kubernetes, managed by ArgoCD, and deployed locally using Minikube.
.
βββ 01-pet-process-1km.py # Python processing script
βββ 02-gef-chirps-process-1km.py # Python processing script
βββ 03-imerg-process-1km.py # Python processing script
βββ data # Data directory
β βββ geofsm-input # Input data
β βββ PET # PET data
β βββ WGS # Shapefile data
β βββ zone_wise_txt_files # Zone text files
βββ utils.py # Utility functions
βββ Dockerfile # Docker image definition
βββ requirements.txt # Python dependencies
βββ k8s # Kubernetes manifests
β βββ deployment.yaml # CronJob definition
β βββ namespace.yaml # Namespace definition
β βββ pvc.yaml # Persistent Volume Claims
β βββ kustomization.yaml # Kustomize configuration
βββ argocd # ArgoCD configuration
β βββ application.yaml # ArgoCD Application
βββ deploy-local.sh # Deployment script
- Docker
- Minikube
- kubectl
- Git
-
Clone this repository:
git clone <repository-url> cd <repository-directory>
-
Make the deployment script executable:
chmod +x deploy-local.sh
-
Run the deployment script:
./deploy-local.sh
This will:
- Start Minikube if it's not running
- Build the Docker image
- Configure Kubernetes manifests
- Apply the manifests to create necessary resources
- Install ArgoCD if it's not already installed
- Create the ArgoCD application
- Trigger an initial job run
-
Access ArgoCD UI: The script will output the URL, username, and password for ArgoCD.
The hazard modeling job is configured to run daily at midnight. You can modify the schedule in k8s/deployment.yaml:
spec:
schedule: "0 0 * * *" # Cron schedule (current: daily at midnight)You can adjust the CPU and memory requirements in the k8s/deployment.yaml file:
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "1000m"The application uses two Persistent Volume Claims:
hazard-data-pvc: For input data (10Gi)hazard-output-pvc: For output data (5Gi)
You can adjust the storage sizes in k8s/pvc.yaml.
To manually trigger the job:
kubectl create job --from=cronjob/hazard-modeling hazard-modeling-manual -n hazard-modelingTo check logs from the most recent job:
kubectl get pods -n hazard-modeling
kubectl logs <pod-name> -n hazard-modelingFor production use:
- Use a container registry like Docker Hub, GitHub Container Registry, or a private registry
- Set up a CI/CD pipeline to build and push images automatically
- Configure proper secrets management for sensitive data
- Use dedicated persistent storage solutions
- Consider implementing monitoring and alerting
- Set up proper backup and disaster recovery procedures
This project is licensed under the MIT License - see the LICENSE file for details.
Cloud-native pipeline for synchronizing hydrology data (riverdepth and streamflow) from FTP server to Google Cloud Storage using Prefect orchestration with GitHub-based deployment.
- Zero Local Storage: Uses temporary directories with automatic cleanup
- Smart Duplicate Prevention: Prevents uploading duplicate files by comparing file sizes
- Google Cloud Storage Upload: Uploads files to GCS with organized folder structure
- GitHub Integration: Direct deployment from repository
- Prefect Cloud Native: Fully managed execution with retry logic
- Container Ready: Docker support for consistent environments
- CI/CD Pipeline: Automated testing and deployment via GitHub Actions
- Comprehensive Logging: Detailed logging for monitoring and debugging
- Scheduled Execution: Runs daily at midnight UTC with manual triggers
GitHub Repository β Prefect Cloud β Managed Workers
β β β
FTP Server β Temp Storage β GCS Upload β Cleanup
β β β β
Discovery Processing Organized No Files
& Filter in Memory Storage Left Behind
geosfm/gcs_upload/
βββ ftp_to_gcs_sync.py # Main synchronization script
βββ prefect.yaml # Prefect deployment configuration
βββ deploy.py # Deployment automation script
βββ Dockerfile # Container configuration
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ DEPLOYMENT.md # Detailed deployment guide
βββ .env.example # Environment template
-
Set up GitHub Secrets in your repository settings:
PREFECT_API_KEY=your_prefect_api_key PREFECT_ACCOUNT_ID=your_account_id PREFECT_WORKSPACE_ID=your_workspace_id FTP_HOST=your_ftp_server FTP_USERNAME=your_username FTP_PASSWORD=your_password FTP_PATH=/output/ GCS_BUCKET=your_bucket_name GCS_CREDENTIALS={"type":"service_account",...} -
Push to main branch - GitHub Actions will automatically:
- Run tests and linting
- Deploy to Prefect Cloud
- Build and push Docker image
-
Monitor in Prefect Cloud UI
-
Prerequisites:
pip install prefect>=3.4.0 google-cloud-storage python-dotenv prefect cloud login -
Set up environment:
cp .env.example .env # Edit .env with your credentials -
Deploy:
python deploy.py
| Variable | Description | Example |
|---|---|---|
FTP_HOST |
FTP server hostname | ftp.example.com |
FTP_USERNAME |
FTP username | hydro_user |
FTP_PASSWORD |
FTP password | secure_password |
FTP_PATH |
FTP directory path | /output/ |
GCS_BUCKET |
GCS bucket name | icpac-hydrology-data |
GCS_PREFIX |
GCS folder prefix | hydrology_data |
GCS_CREDENTIALS |
Service account JSON | {"type":"service_account",...} |
The system automatically creates these Prefect variables:
ftp-host,ftp-username,ftp-password,ftp-pathgcs-bucket,gcs-prefix,gcs-credentials
Build and run locally:
docker build -t hydrology-pipeline .
docker run --env-file .env hydrology-pipelineOr use the automated GitHub image:
docker pull ghcr.io/igad-icpac/devops-hazard-modeling/hydrology-pipeline:latest- Scheduled Flow:
hydrology-midnight-sync- Daily at 00:00 UTC - Manual Flow:
hydrology-on-demand- Trigger anytime - Work Pool:
geosfm-cloud-pool(managed infrastructure)
- Prefect Cloud UI: Real-time flow execution logs
- GitHub Actions: CI/CD pipeline logs
- Local Logs:
logs/directory (development only)
- Files discovered vs downloaded
- Upload success/failure rates
- Duplicate detection efficiency
- Execution duration and resource usage
- Set up GitHub Secrets with your credentials
- Test the pipeline with a manual trigger
- Monitor scheduled runs for the first week
- Set up alerting for failures (optional)
- Backup Strategy: Consider GCS lifecycle policies for cost optimization
- Monitoring: Set up alerts for pipeline failures
- Scaling: Adjust work pool concurrency if needed
- Security: Regular credential rotation
- Make changes in
geosfm/gcs_upload/ - Test locally using Option 2 deployment
- Create PR - triggers automated testing
- Merge to main - triggers production deployment
- Missing credentials: Check Prefect variables in cloud UI
- FTP connection: Verify firewall and network access
- GCS upload: Validate service account permissions
- Work pool: Ensure
geosfm-cloud-poolexists
- Check
DEPLOYMENT.mdfor detailed instructions - Review logs in Prefect Cloud UI
- Contact: Hillary Koros hkoros@icpac.net
- v3.0.0: GitHub integration, temporary storage, container support
- v2.0.0: Prefect Cloud deployment, managed workers
- v1.0.0: Initial FTP to GCS synchronization
Internal use - IGAD-ICPAC