This project implements a credit risk modeling solution using logistic regression to predict loan default risk. It utilizes customer, loan, and credit bureau data to build a predictive model, incorporating data preprocessing, feature engineering, model training, and evaluation.
- Notebook:
credit_risk_model_codebasics.ipynb- Main Jupyter Notebook containing the code for data loading, preprocessing, modeling, and evaluation. - Dataset:
dataset/customers.csv: Contains customer demographic information (e.g., age, gender, income).dataset/loans.csv: Contains loan details (e.g., loan amount, tenure, default status).dataset/bureau_data.csv: Contains credit bureau data (e.g., open accounts, credit utilization).
- Artifacts:
artifacts/model_data.joblib: Saved model file containing the trained logistic regression model, feature names, scaler, and columns to scale.
To run this project, ensure you have the following Python libraries installed:
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- joblib
You can install them using pip:
pip install pandas numpy matplotlib seaborn scikit-learn joblibThe dataset consists of three CSV files:
- customers.csv: Includes customer details such as:
cust_id: Unique customer identifierage,gender,marital_status,employment_status,income, etc.
- loans.csv: Includes loan details such as:
loan_id,cust_id,loan_purpose,loan_type,sanction_amount,default, etc.
- bureau_data.csv: Includes credit bureau data such as:
cust_id,number_of_open_accounts,credit_utilization_ratio,delinquent_months, etc.
Each dataset contains 50,000 records, which are merged on cust_id for analysis.
- Data Loading and Merging:
- Load the three datasets using pandas.
- Merge
customers.csvandloans.csvoncust_id, then merge withbureau_data.csvto create a unified dataset.
- Feature Engineering:
- Create new features like
loan_to_incomeandavg_dpd_per_delinquency. - Encode categorical variables (e.g.,
residence_type,loan_purpose,loan_type) using one-hot encoding. - Scale numerical features using a scaler (saved in
model_data.joblib).
- Create new features like
- Model Training:
- Use logistic regression to predict the
defaultcolumn (True/False). - Split data into training and testing sets using
train_test_split. - Train the model and evaluate feature importance based on model coefficients.
- Use logistic regression to predict the
- Model Saving:
- Save the trained model, feature names, scaler, and columns to scale in
artifacts/model_data.joblib.
- Save the trained model, feature names, scaler, and columns to scale in
- Clone the repository or download the project files.
- Ensure the dataset files are in the
dataset/directory. - Open and run the
credit_risk_model_codebasics.ipynbnotebook in a Jupyter environment. - The notebook will:
- Load and preprocess the data.
- Train the logistic regression model.
- Display feature importance using a bar plot.
- Save the model to
artifacts/model_data.joblib.
- The logistic regression model is trained to predict loan defaults.
- Feature importance is visualized to show which features (e.g.,
credit_utilization_ratio,loan_to_income) most influence the prediction. - The model and preprocessing components are saved for future use.
- Experiment with other algorithms (e.g., Random Forest, XGBoost) for better performance.
- Perform hyperparameter tuning to optimize the logistic regression model.
- Add cross-validation to ensure robust model evaluation.
- Include additional feature engineering to capture more complex patterns.