Skip to content

AmitAK1/pyclean

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

PyClean DSL

A Domain-Specific Language (DSL) compiler for data cleaning and exploratory data analysis (EDA) on CSV files.

Python React TypeScript FastAPI

๐ŸŽฏ Overview

PyClean DSL is a compiler-based domain-specific language that simplifies data cleaning and exploratory data analysis. It allows users to perform complex data operations using simple, English-like syntax without requiring programming knowledge.

Traditional Python/Pandas Approach

import pandas as pd
df = pd.read_csv('data.csv')
df['age'].fillna(0, inplace=True)
df.drop_duplicates(inplace=True)
df['name'] = df['name'].str.strip()
df.to_csv('cleaned.csv')

PyClean DSL Approach

FILL_NULL age WITH 0;
REMOVE_DUPLICATES;
TRIM COLUMN name;

โœจ Features

Data Cleaning Operations

  • Fill null values with custom values or methods (mean, median, mode)
  • Remove duplicate rows
  • Trim whitespace from columns
  • Case conversion (uppercase/lowercase)
  • Range-based data validation
  • Column renaming and dropping

Exploratory Data Analysis (EDA)

  • Dataset information and statistics
  • Univariate analysis with distribution plots
  • Bivariate analysis with scatter plots
  • Outlier detection (Z-score and IQR methods)
  • Correlation analysis with heatmaps

Technology Stack

Backend:

  • Python 3.9+
  • FastAPI - REST API framework
  • Pandas - Data manipulation
  • NumPy - Numerical computing
  • Matplotlib & Seaborn - Visualization

Frontend:

  • React 19 - UI framework
  • TypeScript - Type safety
  • Vite - Build tool
  • Zustand - State management
  • Monaco Editor - Code editor

Compiler:

  • Custom Lexer (Tokenization)
  • Recursive Descent Parser (AST Construction)
  • Semantic Analyzer (Validation)
  • Code Generator (Python/Pandas code generation)

๐Ÿ—๏ธ Architecture

User (Web Interface)
        โ†“
Frontend (React + TypeScript)
        โ†“ HTTP Request
Backend API (FastAPI)
        โ†“ Source Code
Compiler (Lexer โ†’ Parser โ†’ CodeGen)
        โ†“ Generated Code
Execution Engine (Pandas + NumPy)
        โ†“ Results
User (Cleaned Data + Visualizations)

๐Ÿ“ฆ Installation

Prerequisites

  • Python 3.9 or higher
  • Node.js 16+ and npm
  • Git

Clone the Repository

git clone https://github.com/Darkseid1729/pyclean.git
cd pyclean

Backend Setup

cd pyclean-backend
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate

pip install -r requirements.txt

Frontend Setup

cd pyclean-frontend
npm install

๐Ÿš€ Usage

Start the Application

Option 1: Using Batch Files (Windows)

# From project root
START-ALL.bat

Option 2: Manual Start

Backend:

cd pyclean-backend
.venv\Scripts\activate  # Windows
# source .venv/bin/activate  # macOS/Linux
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000

Frontend:

cd pyclean-frontend
npm run dev

Access the application at http://localhost:5173

๐Ÿ“ DSL Syntax Examples

Data Cleaning Operations

# Fill null values
FILL_NULL age WITH 0;
FILL_NULL salary WITH METHOD mean;

# Remove duplicates and trim
REMOVE_DUPLICATES;
TRIM COLUMN name;

# Case conversion
TO_UPPER COLUMN email;
TO_LOWER COLUMN address;

# Range validation
VALIDATE_RANGE age MIN 0 MAX 120;

# Column operations
RENAME COLUMN old_name TO new_name;
DROP COLUMN unnecessary_column;

EDA Operations

# Basic information
EDA_INFO;
EDA_DESCRIBE;

# Analysis
EDA_UNIVARIATE age;
EDA_BIVARIATE age, salary;
EDA_OUTLIERS age;
EDA_CORRELATION;

๐Ÿ”ง Compiler Design Phases

PyClean DSL implements all classical compiler phases:

  1. Lexical Analysis (lexer.py) - Tokenizes DSL source code
  2. Syntax Analysis (parser.py) - Builds Abstract Syntax Tree (AST)
  3. Semantic Analysis (integrated) - Validates column names and types
  4. Code Generation (code_generator.py) - Generates Python/Pandas code
  5. Optimization - Uses in-place operations for efficiency
  6. Execution - Runs generated code and returns results

๐Ÿ“‚ Project Structure

pyclean/
โ”œโ”€โ”€ pyclean-backend/           # Python backend
โ”‚   โ”œโ”€โ”€ api/                   # FastAPI application
โ”‚   โ”‚   โ””โ”€โ”€ main.py           # API endpoints
โ”‚   โ”œโ”€โ”€ compiler/              # Compiler components
โ”‚   โ”‚   โ”œโ”€โ”€ lexer.py          # Tokenizer (342 lines)
โ”‚   โ”‚   โ”œโ”€โ”€ parser.py         # Parser & AST (450+ lines)
โ”‚   โ”‚   โ””โ”€โ”€ code_generator.py # Code generation (500+ lines)
โ”‚   โ”œโ”€โ”€ requirements.txt       # Python dependencies
โ”‚   โ””โ”€โ”€ start-server.bat      # Backend startup script
โ”‚
โ”œโ”€โ”€ pyclean-frontend/          # React frontend
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ components/       # React components
โ”‚   โ”‚   โ”œโ”€โ”€ hooks/            # Custom hooks
โ”‚   โ”‚   โ”œโ”€โ”€ services/         # API services
โ”‚   โ”‚   โ”œโ”€โ”€ store/            # State management
โ”‚   โ”‚   โ””โ”€โ”€ types/            # TypeScript types
โ”‚   โ”œโ”€โ”€ package.json          # Node dependencies
โ”‚   โ””โ”€โ”€ start-dev.bat         # Frontend startup script
โ”‚
โ”œโ”€โ”€ Screenshots/               # UI screenshots
โ”œโ”€โ”€ START-ALL.bat             # Launch both servers
โ””โ”€โ”€ README.md                 # This file

๐ŸŽ“ Academic Context

This project was developed as part of a Compiler Design course (Semester 5) and demonstrates:

  • Complete compiler implementation from scratch
  • Lexical analysis using regular expressions
  • Recursive descent parsing
  • Abstract Syntax Tree design and traversal
  • Code generation techniques
  • Full-stack application development
  • RESTful API design

๐Ÿ‘ฅ Team Members

  • Aditya Gautam (CS23B2037)
  • Amit Anil Kamble (CS23B2034)
  • Riyansh Singh Bhadouriya (CS23B2038)
  • Vamsi (CS23B2027)

๐Ÿ”ฎ Future Enhancements

  • Variables and expressions
  • Conditional statements (IF-THEN-ELSE)
  • Loops for batch operations
  • User-defined functions
  • Join/merge operations
  • Advanced statistical operations
  • Machine learning integration
  • Support for additional file formats (Excel, JSON, SQL)
  • Cloud deployment with Docker
  • Multi-user support with authentication

๐Ÿ“„ License

This project is open source and available for educational purposes.

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“ง Contact


Note: This is an academic project demonstrating compiler design principles. For production use, additional security measures and error handling should be implemented.

About

PyClean DSL is a full-stack compiler that enables data cleaning and EDA using an English-like DSL. It implements Lexer, Recursive Descent Parser, AST, Semantic Analysis, and generates executable Python/Pandas code via FastAPI.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors