Here is a look under the hood at how I engineered the data, trained the model, and deployed it to the web.
The model is powered by a comprehensive dataset consisting of 13 distinct features. These range from categorical demographics (like parent education and school type) to behavioral metrics (study hours and attendance) and baseline test scores.
A major challenge during preprocessing was preventing data leakage while normalizing the variance. To achieve this, I implemented a strict partial feature scaling strategy. Instead of scaling the entire array, I isolated the continuous numerical variables (Age, Study Hours, and Attendance) and applied a Standardization technique:
`z = \frac{x-\mu}{\sigma}`
By keeping the categorical splits and raw test scores completely unscaled, the decision trees were able to process the inputs exactly as intended without distortion.
For the predictive engine, I utilized a Random Forest Classifier. This specific ensemble algorithm was chosen because of its exceptional ability to handle mixed data types and map non-linear relationships.
For example, if a student has an outstanding attendance rate (95%) but critically low weekly study hours, the Random Forest elegantly weighs these conflicting behaviors across hundreds of decision nodes to output an accurate grade prediction.
Building the model locally is only half the battle; the real test is deployment. I chose Streamlit to build a clean, interactive graphical user interface (GUI) and deployed it via Streamlit Community Cloud.
However, this introduced a significant engineering hurdle. The trained Random Forest model (.pkl file) was over 54 MB, which vastly exceeded GitHub's standard web upload limits. Instead of relying on complex Large File Storage (LFS) workarounds, I optimized the pipeline by executing an aggressive compression during the export phase:
The foundation of machine learning evaluation. The dataset is divided into a "Training" set to teach the algorithm recognizing patterns, and a hidden "Testing" set to evaluate its performance on completely unseen, real-world data.
A mathematical technique used to normalize the range of independent variables. This ensures that features with large numerical ranges (like 'Study Hours') do not unfairly dominate the algorithm over smaller features.
Analysis of Variance (ANOVA) is a statistical method used to test differences between two or more means. In feature selection, it helps determine which variables have the most statistically significant relationship with the target outcome.
Classification is the task of predicting discrete categories (e.g., Grade A vs. Grade F), which is the focus of this project. In contrast, Regression predicts continuous numerical values (e.g., forecasting an exact test score of 87.5).
A powerful ensemble learning algorithm that operates by constructing a "forest" of multiple decision trees during training. It merges their outputs to provide highly accurate, stable predictions and naturally resists overfitting.
A specific table layout that visualizes the performance of an algorithm. It explicitly breaks down where the model succeeded and failed by mapping True Positives, False Positives, True Negatives, and False Negatives.
A highly visual, color-coded graphical representation of data. Heat maps are frequently used to visualize Confusion Matrices or variable correlations, making complex numerical arrays instantly understandable to human evaluators.
Raw mathematical outputs are transformed into actionable insights through strategic data visualization. Here are the core graphical techniques used to interpret the model's behavior.
A bar chart directly extracted from the Random Forest algorithm. It ranks all 13 input features to visually prove which variables (e.g., study hours vs. historical scores) carry the most mathematical weight in determining the final grade.
A quadrant-based visual that maps the classifier's exact accuracy. It clearly highlights True Positives against False Negatives, allowing educators to see exactly where the model might be misclassifying a 'C' student as a 'B' student.
A color-coded matrix displaying the correlation coefficients between variables. This ensures the model isn't being skewed by redundant data, visualizing the complex relationships uncovered during the ANOVA feature selection phase.
Utilized during the initial preprocessing phase, these charts map the spread and frequency of continuous variables like attendance and age, highlighting outliers and dictating the necessity for the standard scaling techniques.
Hover over the bars to explore the exact predictive weight of each attribute.
Tracking True Positives along the dark diagonal axis.
| Actual \ Predicted | 0 | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|---|
| 0 | 88 | 65 | 2 | 0 | 0 | 0 |
| 1 | 16 | 223 | 75 | 0 | 0 | 0 |
| 2 | 0 | 74 | 573 | 105 | 0 | 0 |
| 3 | 0 | 0 | 83 | 567 | 71 | 0 |
| 4 | 0 | 0 | 0 | 92 | 521 | 57 |
| 5 | 0 | 0 | 0 | 0 | 95 | 293 |
94%
Overall Accuracy
92%
Precision
93%
Recall Sensitivity
0.92
F1-Score
Final-year researcher at Bundelkhand University, specializing in mathematical statistics, algorithmic modeling, and full-stack data science deployment. Passionate about leveraging Python and Scikit-Learn to bridge the gap between complex statistical theory and accessible web technologies.