Credit Risk Modeling: Predicting Loan Defaults with Machine Learning

Project Overview

I recently worked on a Credit Risk Modeling project using a dataset provided by the Codebasics.io team. The goal was to build a model that predicts the likelihood of a loan default and to create a credit scorecard system for a fictitious NBFC (Non-Banking Financial Company) similar to a CIBIL score. In practical terms, this meant categorizing loan applicants into risk buckets – Poor, Average, Good, or Excellent – based on their predicted creditworthiness. To achieve this, I needed to develop a predictive model using historical loan data (with indicators of whether customers defaulted) and then deploy it via a user-friendly web app. The scope of work included everything from data exploration and feature engineering to model training, evaluation, and building a Streamlit app for demonstration (with model serialization using joblib for reuse).

(Dataset Note: The data – courtesy of Codebasics – contained records of borrowers with demographic info, loan details, and credit bureau metrics like credit utilization, open accounts, delinquency history, etc. There were 50,000 entries, and roughly ~8% of them were labeled as defaults.)

Exploratory Data Analysis (EDA)

Before modeling, I performed an extensive EDA using Pandas, Matplotlib, and Seaborn. I merged three tables (customers, loans, and bureau data) on a customer ID to get a comprehensive dataset for analysis. Key observations:

Class Imbalance: Only a small fraction of borrowers had defaulted (around 1 in 12), which meant our target class was highly imbalanced. This would later influence how I handled model training and evaluation (a high overall accuracy could be misleading if the model simply predicts “No Default” for most cases).
Feature Distributions: I checked distributions of numeric features (income, loan amounts, ages, etc.) and found some outliers which I treated appropriately (e.g., capping extremely high debt-to-income ratios). Categorical variables (gender, loan purpose, residence type) were analyzed for any obvious trends with default rates.
Correlations: Using a heatmap, I identified some strongly correlated features. For example, principal_outstanding and bank_balance_at_application had a correlation of ~0.89, meaning they carried redundant information. To avoid multicollinearity, I later dropped one of each such pair (keeping the more predictive feature). I also noted which factors correlated with the default flag: credit_utilization_ratio (r ≈ 0.40) and loan_to_income (r ≈ 0.32) showed moderate positive correlation with defaults(i.e., higher credit card utilization and larger loans relative to income tended to coincide with higher default risk), whereas something like age had little correlation with default. These insights guided feature selection and engineering (for instance, I created a loan_to_income ratio feature explicitly).

Modeling Approach

For model development, I used scikit-learn (and XGBoost for one advanced model). I started with a straightforward approach and then progressively incorporated strategies to address the class imbalance and improve recall for the minority class (defaulters). Here’s a rundown of the steps and iterations:

Baseline Models (No Resampling): I first trained a basic Logistic Regression and a Random Forest on the original training data (no special handling of imbalance). These models were evaluated on a hold-out test set. As expected, they achieved high overall accuracy (~96%) because the majority class (no default) dominated. However, the recall for the default class was poor – the logistic model caught only ~72% of defaulters, and the random forest about 77%. In other words, many actual defaults were being missed, which is unacceptable in a credit risk context (we’d rather err on the side of flagging potential defaults). Precision for the default class was relatively high (~85% for logistic) since when it did predict a default.The table below summarizes some performance metrics at this stage:

Model	Sampling Strategy	Accuracy	Precision (Default)	Recall (Default)	F1-Score (Default)
Logistic Regression	None (baseline)	96%	85%	72%	78%
Random Forest	None (baseline)	96%	77%	77%	77%

Addressing Imbalance – Under-Sampling: To boost the sensitivity to defaulters, I tried under-sampling the majority class. Using imblearn’s RandomUnderSampler, I down-sampled the non-default cases in the training set to match the number of defaults (drastically reducing the training data size). A logistic regression on this balanced subset achieved a default recall ~96% – a huge jump – but at the expense of precision (around 51%). Essentially, the model was now catching almost all real defaulters, but also raising a lot of false alarms (flagging many good customers as risky). The accuracy also dropped to ~92% due to those false positives.

Addressing Imbalance – Over-Sampling (SMOTE): Next, I explored over-sampling the minority class using SMOTE (Synthetic Minority Over-sampling Technique) combined with Tomek links to generate additional synthetic default cases and clean out ambiguous samples. This gave me a much larger balanced training set (roughly 34k default vs 34k non-default after SMOTE). Training logistic regression on this data yielded a recall ~95% and precision ~56% – a better balance than under-sampling. The model still caught most defaulters, and with slightly fewer false positives.

Advanced Model – XGBoost with SMOTE: Given the complex nonlinear relationships in credit data, I trained an XGBoost classifier on the SMOTE-balanced data, using Optuna for hyperparameter tuning. After trying many parameter combinations, the best XGBoost model delivered about 87% recall and 72% precision on defaults, with overall accuracy ~96%. This was a more balanced outcome: the model misses about 13% of defaulters (versus 28% missed by the baseline logistic), while keeping precision reasonably high (72% means the majority of flagged defaults are correct). The default F1-score improved to ~0.79. I found this trade-off acceptable, since in credit risk we value catching as many bad loans as possible, up to a point of manageable false positives.

To compare, here’s how the improved model stacks up against the baseline:

Model	Accuracy	Precision (Default)	Recall (Default)	F1-Score (Default)
Baseline Logistic (no resampling)	96%	85%	72%	78%
Tuned XGBoost (SMOTE data)	96%	72%	87%	79%

We can see the recall jump from 72% to 87%, greatly improving our capture of risky cases, though the precision dipped as a result. Deciding on this balance was a key learning – it depends on business context how much false positives are acceptable. In an NBFC scenario, catching more defaults early may outweigh the inconvenience of a higher false alarm rate.

Model Performance and Key Insights

Confusion matrix of the final XGBoost model on the test set. The model correctly identifies 934 of 1,074 defaulters and 11,061 of 11,423 non-defaulters.

The confusion matrix above illustrates the final model’s performance. Out of 1,074 actual default cases in the test set, the model caught 934 (true positives) and missed 140 (false negatives). It incorrectly flagged 362 out of 11,423 good loans as defaults (false positives). This aligns with the ~87% recall and ~72% precision discussed. Visually, the dominant blue diagonal in the matrix (especially for the “No Default” class) reflects the high overall accuracy, but the focus for us was really on that bottom row (Default actuals).

To interpret what the model learned, I also examined the feature importances. In the case of logistic regression, we can look at the coefficients of the features to understand their impact on default probability: (Look above Feature Importance in Logistic Regression graph)

Feature importance (coefficient values) from the logistic regression model. Positive coefficients increase default risk, negative coefficients decrease it.

From the above chart, we see intuitive patterns. Features like loan_type_Unsecured, credit_utilization_ratio, loan_to_income, and delinquency-related metrics had the highest positive coefficients – meaning, for example, an unsecured loan (vs a secured one), maxed-out credit cards, or a high loan-to-income ratio all push the model toward predicting default (higher risk). On the flip side, factors such as owning a residence (negative coefficient for residence_type_Owned) or taking a home loan (which often implies collateral, as seen by a negative weight for loan_purpose_Home) reduced the predicted risk. Interestingly, age had a slight negative coefficient – older applicants were a bit less likely to default according to this data. These insights make sense and provide confidence in the model: they align with common credit risk intuition.

(For XGBoost, feature importance can be examined by split gain, but it tended to highlight similar factors – e.g., credit utilization and loan-to-income were among the top predictors.)

Deployment with Streamlit

GitHub Project Repo Link

Streamlit App Link

GitHub README File Link

After achieving a satisfactory model, I deployed it as a web application using Streamlit. The Streamlit UI allows a loan officer (or any user) to input borrower details and immediately get a default probability, a credit score, and a rating category. For deployment, I decided to use the logistic regression model because of its interpretability and ease of scaling into a score. I saved the trained logistic model and preprocessing objects using joblib (bundling the model, feature list, and scaler into one file).

In the app (powered by a simple main.py script and a prediction_helper.py module), when the user enters fields like age, income, loan amount, tenure, number of open accounts, etc., the app first computes some derived features (like the loan-to-income ratio) and applies the same scaling used during training. Then, the logistic model’s prediction is obtained. I convert the prediction probability into an actual credit score on a 300–900 scale (similar to real credit scores) by linearly scaling the non-default probability. For instance, a very low predicted default probability might translate to a score close to 900, whereas a high-risk applicant might end up near 300. Finally, based on the score, I assign a rating: “Poor” for scores below 500, “Average” for 500–649, “Good” for 650–749, and “Excellent” for 750+. This way, the output isn’t just a raw probability – it’s in a form that stakeholders in finance are familiar with (score and grade).

Building the app was a rewarding step: with a click of a button, one can see how changing input parameters (say, increasing the credit card utilization or switching a loan from secured to unsecured) affects the risk prediction. Tools like Streamlit made it straightforward to create this interactive interface, and the model’s fast inference time means the results display instantly.

Reflection and Lessons Learned

Working on this project was an insightful experience. I learned first-hand the challenges of imbalanced datasets and the importance of choosing the right strategy (or combination of strategies) to handle them. Simply looking at accuracy was misleading – the real success was in improving recall for the default class without too much sacrifice in precision. I also got to practice hyperparameter tuning in a smarter way using Optuna, which was much more efficient than manual or grid searches for the XGBoost model.

Another key takeaway was the value of model interpretability in finance. While complex models like XGBoost can boost performance, I saw the benefit of a transparent logistic regression when it came to explaining results and building a credit scorecard. By mapping the logistic output to a score and rating, I effectively created a simple rule-based system on top of the model that stakeholders can understand and trust.

Finally, deploying the model with Streamlit was a highlight – it’s always satisfying to see your machine learning model working in real time with a user-friendly front end. This project reinforced my passion for end-to-end data science: from digging into the data, wrangling and visualizing it, to developing robust models and ultimately delivering a usable tool. I’m grateful to the Codebasics team for providing the dataset and project idea, and I’m excited to apply these learnings to even more challenging real-world problems in the future.