A high-performance, production-grade monitoring engine for HTTP services. Inspired by industry leaders like Datadog and UptimeRobot.
| Service | Status |
|---|---|
| API CI | |
| Coverage |
This project simulates a lean version of real-world monitoring platforms like Datadog and UptimeRobot. The goal is to demonstrate distributed systems design, incident detection, and observability concepts end to end.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Monitoring │ │ BullMQ Queue │ │ Worker Service │
│ Engine │────▶│ (Redis) │────▶│ (Consumer) │
│ (Producer) │ │ │ │ │
└─────────────────┘ └─────────────────┘ └────────┬────────┘
│
┌───────────────────────────┤
│ │
┌─────────▼──────┐ ┌─────────▼──────┐
│ MongoDB │ │ Notification │
│ Atlas │ │ Service │
│ (CheckRuns, │ │ (Resend) │
│ Alerts) │ │ │
└────────────────┘ └────────────────┘
Deployment:
- Frontend: Vercel (Global Edge Network)
- Backend API: Railway
- Worker: Railway (independent service)
- Database: MongoDB Atlas
- Queue: Redis on Railway
- Secure Auth: JWT-based authentication with stateless session management
- Strict Multi-tenancy: Ownership enforcement at the database level — no data leakage between users
- Health Check Engine: Decoupled background producer that schedules periodic HTTP probes
- Producer/Consumer Pattern: Engine enqueues jobs into BullMQ — completely decoupled from execution
- Job Persistence: Jobs survive server restarts — Redis guarantees no check is lost
- Retry Logic: Automatic exponential backoff (3 attempts) on transient failures
- Dead Letter Queue: Jobs that exhaust all retries move to a separate queue — zero silent failures
- Horizontal Scaling: Multiple Worker instances process jobs concurrently with zero duplication — guaranteed by Redis atomic locking
- Performance Tracking: Real-time response time and availability metrics per monitor
- Historical Analysis: Time-series check-run storage for long-term reliability reporting
- P95 Response Time: 95th percentile response time calculated via MongoDB aggregation pipeline
- Dashboard Summary: Account-level overview — total monitors, uptime %, alerts today
- Queue Stats API: Live visibility into waiting, active, completed, and failed job counts
- Visual Analytics: Interactive latency and uptime charts using Recharts
- Consecutive Failure Thresholds: DOWN alerts only after N consecutive failures — no false positives
- State Machine Logic:
DOWN— triggered after N consecutive failuresRECOVERY— triggered only if a previous DOWN was activeSYSTEM_ERROR— triggered when a job exhausts all retry attempts
- Email Notifications: Automatic email on DOWN and RECOVERY events via Resend
- Rate Limiting: Global 100 req/15min + stricter 10 req/15min on auth endpoints
- Helmet: Secure HTTP headers on all responses
- CORS: Configurable origin whitelist
- Redis Cache: Monitor list and single monitor responses cached with 30s TTL
- Cache Invalidation: Automatic invalidation on create, update, and delete
- Cache-aside Pattern: Check Redis first, fall back to MongoDB on miss
| Layer | Technologies |
|---|---|
| Frontend | React 18, Vite, Tailwind CSS, React Query, Recharts |
| Backend | Node.js, TypeScript, Express.js |
| Queue | BullMQ, Redis |
| Cache | Redis (IORedis) |
| Database | MongoDB (Mongoose) |
| Resend | |
| DevOps | Docker, Docker Compose, GitHub Actions |
| Deployment | Vercel, Railway, MongoDB Atlas |
apps/
├── api/ # Node.js TypeScript Service
│ ├── engine/ # Monitoring Engine — Producer
│ ├── queue/ # BullMQ Queue definition and helpers
│ ├── worker/ # BullMQ Worker — Consumer + Dead Letter Queue
│ ├── notifications/ # Email notification service (Resend)
│ ├── config/ # Redis, Cache, DB, Env configuration
│ ├── middleware/ # Auth, Rate Limiting, Error handling
│ ├── modules/ # Domain-driven modules
│ │ ├── auth/ # JWT authentication
│ │ ├── monitors/ # Monitor CRUD + caching
│ │ ├── alerts/ # Alert management
│ │ ├── checkruns/ # Check history + summary stats
│ │ └── dashboard/ # Aggregated dashboard metrics
│ └── __tests__/ # 31 integration tests
└── web/ # React Dashboard
├── ui/ # Reusable Tailwind Components
└── hooks/ # Custom React Query hooks
- Node.js ≥ 20
- Docker Desktop
# Clone the repo
git clone https://github.com/Aliromia21/monitoring-platform.git
cd monitoring-platform
# Create production env file
cp apps/api/.env.example apps/api/.env.production
# Edit .env.production with your secrets
# Start everything
docker compose up --buildThis starts MongoDB, Redis, Redis Commander, API, and Worker — all connected and healthy.
# Install root dependencies
npm install
# Terminal 1 — API + Engine
cd apps/api && npm run dev
# Terminal 2 — Worker (optional, for horizontal scaling)
cd apps/api && npm run dev:worker
# Frontend
cd apps/web && npm run devNODE_ENV=development
PORT=3001
MONGODB_URI=your_mongodb_uri
JWT_SECRET=your_jwt_secret
JWT_EXPIRES_IN=7d
REDIS_URL=redis://localhost:6379
RESEND_API_KEY=your_resend_api_key
SMTP_FROM=Monitoring Platform <onboarding@resend.dev>| Method | Endpoint | Description |
|---|---|---|
| POST | /auth/register |
Register new user |
| POST | /auth/login |
Authenticate user |
| GET | /monitors |
List all monitors (cached) |
| POST | /monitors |
Create monitor |
| GET | /monitors/:id |
Get monitor (cached) |
| PUT | /monitors/:id |
Update monitor |
| DELETE | /monitors/:id |
Delete monitor |
| GET | /monitors/:id/checks |
Check run history |
| GET | /monitors/:id/summary |
Monitor stats + P95 |
| GET | /alerts |
List alerts |
| GET | /dashboard/summary |
Account overview |
| GET | /queue/stats |
Queue health |
| GET | /health |
API health check |
The system is built with a test-first mindset.
- 31 integration tests — all passing on every push
- Worker Tests: Dependency injection pattern — no real HTTP requests
- Queue Tests: Real Redis integration — tests actual job enqueueing
- Alert Rule Tests: Deterministic unit tests for all state machine transitions
- CI/CD: GitHub Actions runs full test suite on every push to main
- Coverage: Tracked via Codecov on every PR
Ali Romia — Software Engineer
- GitHub: github.com/Aliromia21
- LinkedIn: linkedin.com/in/aliromia
MIT License © Ali Romia 2026