A Domain-Specific Language (DSL) compiler for data cleaning and exploratory data analysis (EDA) on CSV files.
PyClean DSL is a compiler-based domain-specific language that simplifies data cleaning and exploratory data analysis. It allows users to perform complex data operations using simple, English-like syntax without requiring programming knowledge.
import pandas as pd
df = pd.read_csv('data.csv')
df['age'].fillna(0, inplace=True)
df.drop_duplicates(inplace=True)
df['name'] = df['name'].str.strip()
df.to_csv('cleaned.csv')FILL_NULL age WITH 0;
REMOVE_DUPLICATES;
TRIM COLUMN name;
- Fill null values with custom values or methods (mean, median, mode)
- Remove duplicate rows
- Trim whitespace from columns
- Case conversion (uppercase/lowercase)
- Range-based data validation
- Column renaming and dropping
- Dataset information and statistics
- Univariate analysis with distribution plots
- Bivariate analysis with scatter plots
- Outlier detection (Z-score and IQR methods)
- Correlation analysis with heatmaps
Backend:
- Python 3.9+
- FastAPI - REST API framework
- Pandas - Data manipulation
- NumPy - Numerical computing
- Matplotlib & Seaborn - Visualization
Frontend:
- React 19 - UI framework
- TypeScript - Type safety
- Vite - Build tool
- Zustand - State management
- Monaco Editor - Code editor
Compiler:
- Custom Lexer (Tokenization)
- Recursive Descent Parser (AST Construction)
- Semantic Analyzer (Validation)
- Code Generator (Python/Pandas code generation)
User (Web Interface)
โ
Frontend (React + TypeScript)
โ HTTP Request
Backend API (FastAPI)
โ Source Code
Compiler (Lexer โ Parser โ CodeGen)
โ Generated Code
Execution Engine (Pandas + NumPy)
โ Results
User (Cleaned Data + Visualizations)
- Python 3.9 or higher
- Node.js 16+ and npm
- Git
git clone https://github.com/Darkseid1729/pyclean.git
cd pycleancd pyclean-backend
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txtcd pyclean-frontend
npm installOption 1: Using Batch Files (Windows)
# From project root
START-ALL.batOption 2: Manual Start
Backend:
cd pyclean-backend
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS/Linux
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000Frontend:
cd pyclean-frontend
npm run devAccess the application at http://localhost:5173
# Fill null values
FILL_NULL age WITH 0;
FILL_NULL salary WITH METHOD mean;
# Remove duplicates and trim
REMOVE_DUPLICATES;
TRIM COLUMN name;
# Case conversion
TO_UPPER COLUMN email;
TO_LOWER COLUMN address;
# Range validation
VALIDATE_RANGE age MIN 0 MAX 120;
# Column operations
RENAME COLUMN old_name TO new_name;
DROP COLUMN unnecessary_column;
# Basic information
EDA_INFO;
EDA_DESCRIBE;
# Analysis
EDA_UNIVARIATE age;
EDA_BIVARIATE age, salary;
EDA_OUTLIERS age;
EDA_CORRELATION;
PyClean DSL implements all classical compiler phases:
- Lexical Analysis (
lexer.py) - Tokenizes DSL source code - Syntax Analysis (
parser.py) - Builds Abstract Syntax Tree (AST) - Semantic Analysis (integrated) - Validates column names and types
- Code Generation (
code_generator.py) - Generates Python/Pandas code - Optimization - Uses in-place operations for efficiency
- Execution - Runs generated code and returns results
pyclean/
โโโ pyclean-backend/ # Python backend
โ โโโ api/ # FastAPI application
โ โ โโโ main.py # API endpoints
โ โโโ compiler/ # Compiler components
โ โ โโโ lexer.py # Tokenizer (342 lines)
โ โ โโโ parser.py # Parser & AST (450+ lines)
โ โ โโโ code_generator.py # Code generation (500+ lines)
โ โโโ requirements.txt # Python dependencies
โ โโโ start-server.bat # Backend startup script
โ
โโโ pyclean-frontend/ # React frontend
โ โโโ src/
โ โ โโโ components/ # React components
โ โ โโโ hooks/ # Custom hooks
โ โ โโโ services/ # API services
โ โ โโโ store/ # State management
โ โ โโโ types/ # TypeScript types
โ โโโ package.json # Node dependencies
โ โโโ start-dev.bat # Frontend startup script
โ
โโโ Screenshots/ # UI screenshots
โโโ START-ALL.bat # Launch both servers
โโโ README.md # This file
This project was developed as part of a Compiler Design course (Semester 5) and demonstrates:
- Complete compiler implementation from scratch
- Lexical analysis using regular expressions
- Recursive descent parsing
- Abstract Syntax Tree design and traversal
- Code generation techniques
- Full-stack application development
- RESTful API design
- Aditya Gautam (CS23B2037)
- Amit Anil Kamble (CS23B2034)
- Riyansh Singh Bhadouriya (CS23B2038)
- Vamsi (CS23B2027)
- Variables and expressions
- Conditional statements (IF-THEN-ELSE)
- Loops for batch operations
- User-defined functions
- Join/merge operations
- Advanced statistical operations
- Machine learning integration
- Support for additional file formats (Excel, JSON, SQL)
- Cloud deployment with Docker
- Multi-user support with authentication
This project is open source and available for educational purposes.
Contributions are welcome! Please feel free to submit a Pull Request.
- GitHub: @AmitAK1
- Email: cs23b2034@iiitdm.ac.in
Note: This is an academic project demonstrating compiler design principles. For production use, additional security measures and error handling should be implemented.