Build a Machine Learning model that predicts whether a mushroom is poisonous or edible based on its physical and environmental attributes. The goal is to help identify potentially harmful mushrooms early so safer decisions can be made while handling or consuming them.
This application evaluates multiple classification models to determine the outcome:
e→ Edible Mushroomp→ Poisonous Mushroom
This project uses a mushroom classification dataset containing real-world style observations of mushroom specimens. The dataset includes cap, gill, stem, veil, ring, and habitat-related features that are highly useful for predicting whether a mushroom is edible or poisonous.
The dataset used in this project is available locally in this repository at: data/mushroom.csv
- Total Records: 12,214
- Total Columns: 21
- Input Features: 20
- Target Column:
class
cap-diameter: diameter of the mushroom capcap-shape: shape of the capcap-surface: texture of the cap surfacecap-color: color of the capdoes-bruise-or-bleed: whether the mushroom bruises or bleedsgill-attachment: type of gill attachmentgill-spacing: spacing between gillsgill-color: color of the gillsstem-height: height of the stemstem-width: width of the stemstem-root: root characteristic of the stemstem-surface: texture of the stem surfacestem-color: color of the stemveil-type: type of veil presentveil-color: color of the veilhas-ring: whether a ring is presentring-type: type of ringspore-print-color: color of the spore printhabitat: natural habitat of the mushroomseason: season in which the mushroom appearsclass: mushroom class (e= edible,p= poisonous)
Data is split using stratified train-test split (test_size=0.2, random_state=42):
- Train set: 9,771 rows
- Test set: 2,443 rows
Data files:
data/mushroom_train.csvdata/mushroom_test.csv
- Logistic Regression (
model/logistic_regression.py) - Decision Tree (
model/decision_tree.py) - KNN (
model/knn.py) - Naive Bayes (
model/naive_bayes.py) - Random Forest (
model/random_forest.py) - XGBoost (
model/xgboost_model.py)
All 6 models are evaluated on:
- Accuracy
- AUC Score
- Precision
- Recall
- F1 Score
- Matthews Correlation Coefficient (MCC)
The following values are populated from metrics_comparison.csv:
| ML Model Name | Accuracy | AUC | Precision | Recall | F1 | MCC |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.8420 | 0.9184 | 0.8403 | 0.8438 | 0.8411 | 0.6841 |
| Decision Tree | 0.9947 | 0.9948 | 0.9945 | 0.9948 | 0.9946 | 0.9892 |
| KNN | 0.9992 | 1.0000 | 0.9992 | 0.9992 | 0.9992 | 0.9983 |
| Naive Bayes | 0.6013 | 0.8357 | 0.7550 | 0.6402 | 0.5666 | 0.3782 |
| Random Forest (Ensemble) | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| XGBoost (Ensemble) | 0.9476 | 0.9913 | 0.9462 | 0.9482 | 0.9471 | 0.8944 |
| ML Model Name | Observation about model performance |
|---|---|
| Logistic Regression | Good baseline performance and balanced precision/recall; lower than tree/ensemble methods on this dataset. |
| Decision Tree | Very strong performance across all metrics; captures non-linear splits effectively. |
| KNN | Near-perfect scores, indicating strong local separability of classes after preprocessing. |
| Naive Bayes | Lowest overall scores; conditional independence assumption appears less suitable for this data distribution. |
| Random Forest (Ensemble) | Best overall performance (perfect metrics in current run), indicating excellent robustness and generalization on this split. |
| XGBoost (Ensemble) | Excellent AUC and MCC; slightly below Random Forest/KNN but still high-performing and reliable. |
The Streamlit app (app.py) includes all required components:
- ✅ Dataset upload option (CSV)
- ✅ Model selection dropdown (multiple models)
- ✅ Display of evaluation metrics
- ✅ Confusion matrix / classification report
Additional implemented capabilities:
- Prediction preview table
- Downloadable prediction CSV
- Model comparison table and metric visualization
Run locally:
streamlit run app.pyproject-folder/
├── app.py
├── requirements.txt
├── README.md
├── README_ASSIGNMENT.md
├── train.py
├── metrics_comparison.csv
├── data/
│ ├── mushroom.csv
│ ├── mushroom_train.csv
│ ├── mushroom_test.csv
│ └── preprocess_data.py
└── model/
├── logistic_regression.py
├── decision_tree.py
├── knn.py
├── naive_bayes.py
├── random_forest.py
├── xgboost_model.py
└── saved_models/
├── logistic_regression.pkl
├── decision_tree.pkl
├── knn.pkl
├── naive_bayes.pkl
├── random_forest.pkl
└── xgboost.pkl
Install dependencies:
pip install -r requirements.txtTrain all models and regenerate metrics:
python train.pyRun Streamlit app:
streamlit run app.py- Name: Sumanth_T_P
- BITS ID: 2025AA05544
- Streamlit App Link:
https://ml-models-app-sumanth-tp-2025aa05544.streamlit.app/