End-to-End Machine Learning: Predicting Student Academic Performance

Power Series Distribution and it's applications

Published by Sumit Kumar On July 20, 2025

Power Series Distribution Power series distributions are the discrete type of distributions and we can say it they are family of some other distributions like Poisson Distribution, Geometric Distribution, Negative Binomial Distribution. Suppose a = $(a_1, a_2, a_3,...........a_n)$ is a sequence of non negative real numbers then power series coefficient is given by `f_(\theta)(x)=\sum_{x=1}^{n} a_{x}\theta^{x}` the Power Series is defined by $\lim_{n\arrow \infty) f_{\theta}(x)$ and is denoted by `f(x) = \sum_{x=1}^{\infty} a_{x}\theta^{x} ` this power is with center 0 and its radius of convergence is $|x|< R$ where is $ R = \left( \frac{1}{\lim |\frac{a_{n+1}}{a_n}|} \right)$ if radius of convergence is $\infty$ then the series is convergent for all real values of $x$,. Its Distribution if we restrict the $x \in [0, r)$, this will be our parameter space. then its distribution is given by Power series distribution) [Roy and Mitra (1957)] Let $X_1, X_2...

🚀 Try the Live Predictor App Here

Welcome back to Data Deep Dive. For my final year BSc project, I wanted to tackle a critical issue in educational systems: the reliance on reactive grading. Too often, at-risk students are identified only after they have failed a major examination.

To solve this, I built an end-to-end machine learning pipeline that shifts the focus from reactive grading to proactive academic intervention. By analyzing demographic, behavioral, and preliminary academic data, this application predicts a student's final grade category, allowing educators to step in early.

Here is a look under the hood at how I engineered the data, trained the model, and deployed it to the web.

Data Architecture & Feature Engineering

The model is powered by a comprehensive dataset consisting of 13 distinct features. These range from categorical demographics (like parent education and school type) to behavioral metrics (study hours and attendance) and baseline test scores.

A major challenge during preprocessing was preventing data leakage while normalizing the variance. To achieve this, I implemented a strict partial feature scaling strategy. Instead of scaling the entire array, I isolated the continuous numerical variables (Age, Study Hours, and Attendance) and applied a Standardization technique:

`z = \frac{x-\mu}{\sigma}`

By keeping the categorical splits and raw test scores completely unscaled, the decision trees were able to process the inputs exactly as intended without distortion.

The Engine: Random Forest Classifier

For the predictive engine, I utilized a Random Forest Classifier. This specific ensemble algorithm was chosen because of its exceptional ability to handle mixed data types and map non-linear relationships.

For example, if a student has an outstanding attendance rate (95%) but critically low weekly study hours, the Random Forest elegantly weighs these conflicting behaviors across hundreds of decision nodes to output an accurate grade prediction.

Overcoming Deployment Hurdles

Building the model locally is only half the battle; the real test is deployment. I chose Streamlit to build a clean, interactive graphical user interface (GUI) and deployed it via Streamlit Community Cloud.

However, this introduced a significant engineering hurdle. The trained Random Forest model (.pkl file) was over 54 MB, which vastly exceeded GitHub's standard web upload limits. Instead of relying on complex Large File Storage (LFS) workarounds, I optimized the pipeline by executing an aggressive compression during the export phase:

🎓 End-to-End Predictor Architecture

The Theory: Traditional grading is reactive. This system leverages statistical modeling to provide proactive intervention. By processing continuous variables through partial feature scaling to normalize variance—while preserving raw categorical splits—the model maps complex, non-linear student behaviors without data leakage.

📊

1. The Inputs

13 Demographic, Behavioral, and Academic Features (e.g., Attendance, Study Hours).

➔

🧠

2. The Engine

Random Forest Classifier analyzing inputs across hundreds of decision nodes.

➔

🎯

3. The Output

Real-time inference translating raw arrays into human-readable Grade Categories (A-F).

🛠️ Core Technologies & Libraries

🐍

Python

The foundational programming language powering the entire backend logic, data processing, and machine learning pipeline.

⚙️

Scikit-Learn

The primary machine learning library utilized to engineer the StandardScaler and construct the Random Forest Classifier algorithm.

🔢

NumPy & Pandas

Pandas managed the tabular data structures during training, while NumPy handled the mathematical matrix operations and multidimensional arrays required for real-time inference.

📦

Joblib

Managed the heavy-lifting of model serialization. Crucially utilized to execute level-9 compression on the final algorithm to successfully bypass cloud upload limits.

👑

Streamlit

The rapid-deployment web framework that translated the backend mathematical Python scripts into a highly interactive, user-friendly Graphical User Interface.

☁️

GitHub & Community Cloud

GitHub served as the version control system, creating a continuous integration bridge to Streamlit Community Cloud for live, public server hosting.

🧠 The Machine Learning Sequence

Train & Test Data Data Split

The foundation of machine learning evaluation. The dataset is divided into a "Training" set to teach the algorithm recognizing patterns, and a hidden "Testing" set to evaluate its performance on completely unseen, real-world data.

Scaling Features Preprocessing

A mathematical technique used to normalize the range of independent variables. This ensures that features with large numerical ranges (like 'Study Hours') do not unfairly dominate the algorithm over smaller features.

ANOVA Statistics

Analysis of Variance (ANOVA) is a statistical method used to test differences between two or more means. In feature selection, it helps determine which variables have the most statistically significant relationship with the target outcome.

Classification vs. Regression Task Type

Classification is the task of predicting discrete categories (e.g., Grade A vs. Grade F), which is the focus of this project. In contrast, Regression predicts continuous numerical values (e.g., forecasting an exact test score of 87.5).

Random Forest Algorithm

A powerful ensemble learning algorithm that operates by constructing a "forest" of multiple decision trees during training. It merges their outputs to provide highly accurate, stable predictions and naturally resists overfitting.

Confusion Matrix Evaluation

A specific table layout that visualizes the performance of an algorithm. It explicitly breaks down where the model succeeded and failed by mapping True Positives, False Positives, True Negatives, and False Negatives.

Heat Map Matrix Visualization

A highly visual, color-coded graphical representation of data. Heat maps are frequently used to visualize Confusion Matrices or variable correlations, making complex numerical arrays instantly understandable to human evaluators.

📊 Data Interpretation & Visualizations

Raw mathematical outputs are transformed into actionable insights through strategic data visualization. Here are the core graphical techniques used to interpret the model's behavior.

📉

Feature Importance Chart

Model Transparency

A bar chart directly extracted from the Random Forest algorithm. It ranks all 13 input features to visually prove which variables (e.g., study hours vs. historical scores) carry the most mathematical weight in determining the final grade.

🔲

Confusion Matrix Grid

Performance Evaluation

A quadrant-based visual that maps the classifier's exact accuracy. It clearly highlights True Positives against False Negatives, allowing educators to see exactly where the model might be misclassifying a 'C' student as a 'B' student.

🌡️

Correlation Heat Map

Multicollinearity Check

A color-coded matrix displaying the correlation coefficients between variables. This ensures the model isn't being skewed by redundant data, visualizing the complex relationships uncovered during the ANOVA feature selection phase.

📊

Distribution Histograms

Exploratory Data Analysis

Utilized during the initial preprocessing phase, these charts map the spread and frequency of continuous variables like attendance and age, highlighting outliers and dictating the necessity for the standard scaling techniques.

Interactive Feature Importance

Hover over the bars to explore the exact predictive weight of each attribute.

Interactive Student Demographics

Impact of Study Methods on Overall Score

Average Scores Across Core Subjects

Confusion Matrix: Predicted vs Actual Grades

Tracking True Positives along the dark diagonal axis.

Actual \ Predicted	0	1	2	3	4	5
0	88	65	2	0	0	0
1	16	223	75	0	0	0
2	0	74	573	105	0	0
3	0	0	83	567	71	0
4	0	0	0	92	521	57
5	0	0	0	0	95	293

🎯 Final Model Performance Evaluation

✅

94%

Overall Accuracy

🔍

92%

Precision

🔄

93%

Recall Sensitivity

⚖️

0.92

F1-Score

Sumit Kumar

BSc (Hons) Mathematical Science & Computer Application

Final-year researcher at Bundelkhand University, specializing in mathematical statistics, algorithmic modeling, and full-stack data science deployment. Passionate about leveraging Python and Scikit-Learn to bridge the gap between complex statistical theory and accessible web technologies.

Connect on LinkedIn View GitHub Repo

Report Abuse

Labels

Literature in Testing of Hypotheses: Concepts, Theory, and Applications