GraphRank is a high-performance distributed systems project designed to simulate the "social brain" of professional networking platforms. It processes millions of data points to identify influencers, detect communities, and rank content feeds in real-time.
In modern social networks, "who you know" and "what you see" are determined by massive graph computations. GraphRank tackles this challenge by combining big data engineering with graph theory.
The system processes:
- 100,000+ Synthetic Users
- 1,000,000+ Social Edges (Connections)
- 500,000+ Interaction Logs (Likes, Shares, Comments)
| Category | Technology |
|---|---|
| Data Processing | PySpark (Distributed Batch Processing) |
| Backend API | FastAPI (Asynchronous Python) |
| Graph Logic | NetworkX & Custom Adjacency Optimizations |
| Databases | PostgreSQL (Structured), Redis (Caching Layer) |
| DevOps | Docker, Docker Compose |
We calculate user "importance" using a multi-factor weighted formula:
Posts are ranked dynamically to ensure high-quality content discovery using a time-decay model:
"People You May Know" is driven by:
- Jaccard Similarity: To find overlap in mutual connections.
- Community Detection: Using the Louvain Algorithm to identify industry clusters.
- Synthetic Layer: Generates pseudo-realistic social data.
- Spark Layer: Aggregates raw logs and builds the weighted graph.
- Graph Engine: Calculates PageRank and clustering coefficients.
- API Layer: Serves ranked feeds and recommendations via REST endpoints.
- Algorithm Design: Lead on PageRank implementation and similarity metrics.
- Feature Engineering: Defining interaction weights and engagement scoring.
- Validation: Evaluating model performance using Precision@K and Recall.
- System Infrastructure: Dockerization, Redis integration, and API architecture.
- Data Pipelines: Optimizing PySpark jobs for large-scale joins and broadcast variables.
- Latency Optimization: Ensuring sub-150ms response times for feed generation.
- Clone the repo:
git clone cd graphrank
2. **Spin up the environment:**
```bash
docker-compose up --build
- Run the Spark Pipeline:
docker exec -it graphrank_spark spark-submit /jobs/process_graph.py
- Scale: Support up to 1.5M interaction records.
- Latency: < 150ms for recommendation API calls.
- Efficiency: 40% reduction in processing time through Spark parallelization.
This project is licensed under the MIT License - see the LICENSE file for details.
Base URL (Local Sandbox): http://localhost:8000
GET /api/healthReturns pipeline and API status metric.
GET /api/top-influencers?limit=10Queries PostgreSQL for users sorted rapidly by pagerank_score.
GET /api/recommendations/{user_id}Queries people-you-may-know logic relying heavily on a Redis Cache layer to serve predictions < 10ms.
Configure .env in root with the following definitions matching your local Docker mapping:
POSTGRES_USER=admin
POSTGRES_PASSWORD=admin
POSTGRES_DB=graphrank
DB_HOST=postgres
REDIS_HOST=redis
REDIS_PORT=6379To certify the Redis layer performs under high concurrency, initiate the swarm logic:
locust -f locustfile.py --host=http://localhost:8000Navigate to http://localhost:8089 to launch the tests.