{"id":608751,"date":"2026-04-04T07:48:00","date_gmt":"2026-04-04T12:48:00","guid":{"rendered":"https:\/\/towardsdatascience.com\/?p=608751"},"modified":"2026-04-04T07:48:00","modified_gmt":"2026-04-04T12:48:00","slug":"building-robust-credit-scoring-models-with-python","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/","title":{"rendered":"Building Robust Credit Scoring Models with Python"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1775060091435\" class=\"mdspan-comment\">Thank<\/mdspan> you for your feedback and interest in my previous <a href=\"https:\/\/towardsdatascience.com\/exploratory-data-analysis-for-credit-scoring-with-python\/\">article<\/a>. Since several readers asked how to replicate the analysis, I decided to share the full code on <a href=\"https:\/\/github.com\/Jumbong\/Credit_scoring_project\">GitHub<\/a> for both this article and the previous one. This will allow you to easily reproduce the results, better understand the methodology, and explore the project in more detail.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this post, we show that analyzing the relationships between variables in credit scoring serves two main purposes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Evaluating the ability of explanatory variables to discriminate default<\/strong> (see section 1.1)<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Reducing dimensionality by studying the relationships between explanatory variables<\/strong> (see section 1.2)<\/li>\n\n\n\n<li class=\"wp-block-list-item\">In Section 1.3, we apply these methods to the dataset introduced in our previous post. <\/li>\n\n\n\n<li class=\"wp-block-list-item\">In conclusion, we summarize the key takeaways and highlight the main points that can be useful for interviews, whether for an internship or a full-time position.<\/li>\n<\/ul>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">As we grow and improve our modeling skills, we often look back and smile at our early attempts, the first models we built, and the mistakes we made along the way.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I remember building a scoring model using Kaggle resources without truly understanding how to analyze relationships between variables. Whether it involved two continuous variables, a continuous and a categorical variable, or two categorical variables, I lacked both the graphical intuition and the statistical tools needed to study them properly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It wasn\u2019t until my third year, during a credit scoring project, that I fully grasped their importance. That experience is why I strongly encourage anyone building their first scoring model to take the analysis of relationships between variables seriously.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Why Studying Relationships Between Variables Matters<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The first objective is to identify the variables that best explain the phenomenon under study, for example, predicting default.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, correlation is not causation. Any insight must be supported by:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">academic research<\/li>\n\n\n\n<li class=\"wp-block-list-item\">domain expertise<\/li>\n\n\n\n<li class=\"wp-block-list-item\">data visualization<\/li>\n\n\n\n<li class=\"wp-block-list-item\">and expert judgment<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The second objective is dimensionality reduction. By defining appropriate thresholds, we can preselect variables that show meaningful associations with the target or with other predictors. This helps reduce redundancy and improve model performance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It also provides early guidance on which variables are likely to be retained in the final model and helps detect potential modeling issues. For instance, if a variable with no meaningful relationship to the target ends up in the final model, this may indicate a weakness in the modeling process. In such cases, it is important to revisit earlier steps and identify possible shortcomings.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this article, we focus on three types of relationships:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Two continuous variables<\/li>\n\n\n\n<li class=\"wp-block-list-item\">One continuous and one qualitative variable<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Two qualitative variables<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">All analyses are conducted on the training dataset. In a <a href=\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-part-3\/\" data-type=\"link\" data-id=\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-part-3\/\">previous article<\/a>, we addressed outliers and missing values, an essential prerequisite before any statistical analysis. Therefore, we will work with a cleaned dataset to analyze relationships between variables.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Outliers and missing values can significantly distort both statistical measures and visual interpretations of relationships. This is why it is critical to ensure that preprocessing steps, such as handling missing values and outliers, are performed carefully and appropriately.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The goal of this article is not to provide an exhaustive list of statistical tests for measuring associations between variables. Instead, it aims to give you the essential foundations needed to understand the importance of this step in building a reliable scoring model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The methods presented here are among the most commonly used in practice. However, depending on the context, analysts may rely on additional or more advanced techniques.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By the end of this article, you should be able to confidently answer the following three questions, which are often asked in internships or job interviews:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">How do you measure the relationship between two continuous variables?<\/li>\n\n\n\n<li class=\"wp-block-list-item\">How do you measure the relationship between two qualitative variables?<\/li>\n\n\n\n<li class=\"wp-block-list-item\">How do you measure the relationship between a qualitative variable and a continuous variable?<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Graphical Analysis<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">I initially wanted to skip this step and go straight to statistical testing. However, since this article is intended for beginners in modeling, this is arguably the most important part.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Whenever you have the opportunity to visualize your data, you should take it. Visualization can reveal a great deal about the underlying structure of the data, often more than a single statistical metric.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This step is particularly critical during the exploratory phase, as well as during decision-making and discussions with domain experts. The insights derived from visualizations should always be validated by:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">subject matter experts<\/li>\n\n\n\n<li class=\"wp-block-list-item\">the context of the study<\/li>\n\n\n\n<li class=\"wp-block-list-item\">and relevant academic or scientific literature<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">By combining these perspectives, we can eliminate variables that are not relevant to the problem or that may lead to misleading conclusions. At the same time, we can identify the most informative variables that truly help explain the phenomenon under study.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When this step is carefully executed and supported by academic research and expert validation, we can have greater confidence in the statistical tests that follow, which ultimately summarize the information into indicators such as p-values or correlation coefficients.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">1. Application to Credit Scoring<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">In credit scoring, the objective is to select, from a set of candidate variables, those that best explain the target, typically default.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is why we study relationships between variables.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will see later that some models are sensitive to <strong>multicollinearity<\/strong>, which occurs when multiple variables carry similar information. Reducing redundancy is therefore essential.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In our case, the target variable is binary (default vs. non-default), and we aim to discriminate it using explanatory variables that may be either continuous or categorical. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Graphically, we can assess the discriminative power of these variables, that is, their ability to predict the default outcome. In the following section, we present graphical methods and test statistics that can be automated to analyze the relationship between continuous or categorical explanatory variables and the target variable, using programming languages such as Python.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1.1 Evaluation of Predictive Power<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In this section, we present the graphical and statistical tools used to assess the ability of both continuous and categorical explanatory variables to capture the relationship with the target variable, namely default (def).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.1.1 Continuous Variable vs. Binary Target<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If the variable we are evaluating is continuous, the goal is to compare its distribution across the two target classes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">non-default (<math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><semantics><mrow><mi>d<\/mi><mi>e<\/mi><mi>f<\/mi><mo>=<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">def = 0<\/annotation><\/semantics><\/math>)<\/li>\n\n\n\n<li class=\"wp-block-list-item\">default (<math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><semantics><mrow><mi>d<\/mi><mi>e<\/mi><mi>f<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">def = 1<\/annotation><\/semantics><\/math>)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We can use:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>boxplots<\/strong> to compare medians and dispersion<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>density plots (KDE)<\/strong> to compare distributions<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>cumulative distribution functions (ECDF)<\/strong><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The key idea is simple:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Does the distribution of the variable differ between defaulters and non-defaulters?<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">If the answer is yes, the variable may have discriminative power. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Assume we want to assess how well <strong><code>person_income<\/code><\/strong> discriminates between defaulting and non-defaulting borrowers. Graphically, we can compare summary statistics such as the mean or median, as well as the distribution through density plots or cumulative distribution functions (CDFs) for defaulted and non-defaulted counterparties. The resulting visualization is shown below.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def plot_continuous_vs_categorical(\n    df,\n    continuous_var,\n    categorical_var,\n    category_labels=None,\n    figsize=(12, 10),\n    sample=None\n):\n    &quot;&quot;&quot;\n     Compare a continuous variable across categories\n    using boxplot, KDE, and ECDF (2x2 layout).\n    &quot;&quot;&quot;\n\n    sns.set_style(&quot;white&quot;)\n\n    data = df[[continuous_var, categorical_var]].dropna().copy()\n\n    # Optional sampling\n    if sample:\n        data = data.sample(sample, random_state=42)\n\n    categories = sorted(data[categorical_var].unique())\n\n    # Labels mapping (optional)\n    if category_labels:\n        labels = [category_labels.get(cat, str(cat)) for cat in categories]\n    else:\n        labels = [str(cat) for cat in categories]\n\n    fig, axes = plt.subplots(2, 2, figsize=figsize)\n\n    # --- 1. Boxplot ---\n    sns.boxplot(\n        data=data,\n        x=categorical_var,\n        y=continuous_var,\n        ax=axes[0, 0]\n    )\n    axes[0, 0].set_title(&quot;Boxplot (median &amp; spread)&quot;, loc=&quot;left&quot;)\n\n    # --- 2. Boxplot comparaison m\u00e9dianes ---\n    sns.boxplot(\n        data=data,\n        x=categorical_var,\n        y=continuous_var,\n        ax=axes[0, 1],\n        showmeans=True,\n        meanprops={\n            &quot;marker&quot;: &quot;o&quot;,\n            &quot;markerfacecolor&quot;: &quot;white&quot;,\n            &quot;markeredgecolor&quot;: &quot;black&quot;,\n            &quot;markersize&quot;: 6\n        }\n    )\n\n    axes[0, 1].set_title(&quot;Median comparison (Boxplot)&quot;, loc=&quot;left&quot;)\n    medians = data.groupby(categorical_var)[continuous_var].median()\n\n    for i, cat in enumerate(categories):\n        axes[0, 1].text(\n            i,\n            medians[cat],\n            f&quot;{medians[cat]:.2f}&quot;,\n            ha=&#039;center&#039;,\n            va=&#039;bottom&#039;,\n            fontsize=10,\n            fontweight=&#039;bold&#039;\n        )\n    # --- 3. KDE only ---\n    for cat, label in zip(categories, labels):\n        subset = data[data[categorical_var] == cat][continuous_var]\n        sns.kdeplot(\n            subset,\n            ax=axes[1, 0],\n            label=label\n        )\n    axes[1, 0].set_title(&quot;Density comparison (KDE)&quot;, loc=&quot;left&quot;)\n    axes[1, 0].legend()\n\n    # --- 4. ECDF ---\n    for cat, label in zip(categories, labels):\n        subset = np.sort(data[data[categorical_var] == cat][continuous_var])\n        y = np.arange(1, len(subset) + 1) \/ len(subset)\n        axes[1, 1].plot(subset, y, label=label)\n    axes[1, 1].set_title(&quot;Cumulative distribution (ECDF)&quot;, loc=&quot;left&quot;)\n    axes[1, 1].legend()\n\n    # Clean style (Storytelling with Data)\n    for ax in axes.flat:\n        sns.despine(ax=ax)\n        ax.grid(axis=&quot;y&quot;, alpha=0.2)\n\n    plt.tight_layout()\n    plt.show()\n\nplot_continuous_vs_categorical(\n    df=train_imputed,\n    continuous_var=&quot;person_income&quot;,\n    categorical_var=&quot;def&quot;,\n    category_labels={0: &quot;No Default&quot;, 1: &quot;Default&quot;},\n    figsize=(14, 12),\n    sample=5000\n)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/03\/image-353-1024x877.png\" alt=\"\" class=\"wp-image-652693\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Defaulted borrowers tend to have lower incomes than non-defaulted borrowers. The distributions show a clear shift, with defaults concentrated at lower income levels. Overall, income has good discriminatory power for predicting default.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.1.2 Statistical Test: Kruskal\u2013Wallis for a Continuous Variable vs. a Binary Target<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To formally assess this relationship, we use the <strong>Kruskal\u2013Wallis test<\/strong>, a non-parametric method.<br>It evaluates whether multiple independent samples come from the same distribution.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">More precisely, it tests whether <strong>k samples (k \u2265 2)<\/strong> originate from the same population, or from populations with identical characteristics in terms of a <em>position parameter<\/em>. This parameter is conceptually close to the median, but the Kruskal\u2013Wallis test incorporates more information than the median alone.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The principle of the test is as follows. Let (<math data-latex=\"M_i\"><semantics><msub><mi>M<\/mi><mi>i<\/mi><\/msub><annotation encoding=\"application\/x-tex\">M_i<\/annotation><\/semantics><\/math>) denote the position parameter of sample i. The hypotheses are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><math data-latex=\"( H_0 ): ( M_1 = \\dots = M_k )\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>H<\/mi><mn>0<\/mn><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo lspace=\"0.2222em\" rspace=\"0.2222em\">:<\/mo><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>M<\/mi><mn>1<\/mn><\/msub><mo>=<\/mo><mo>\u22ef<\/mo><mo>=<\/mo><msub><mi>M<\/mi><mi>k<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">( H_0 ): ( M_1 = \\dots = M_k )<\/annotation><\/semantics><\/math><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><math data-latex=\"( H_1 )\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>H<\/mi><mn>1<\/mn><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">( H_1 )<\/annotation><\/semantics><\/math>: There exists at least one pair (i, j) such that <math data-latex=\"( M_i \\neq M_j )\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>M<\/mi><mi>i<\/mi><\/msub><mo>\u2260<\/mo><msub><mi>M<\/mi><mi>j<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">( M_i \\neq M_j )<\/annotation><\/semantics><\/math><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">When ( k = 2 ), the Kruskal\u2013Wallis test reduces to the <strong>Mann\u2013Whitney U test<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The test statistic approximately follows a Chi-square distribution with K-1 degrees of freedom (for sufficiently large samples).<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">If the p-value &lt; 5%, we reject <math data-latex=\"H_0\"><semantics><msub><mi>H<\/mi><mn>0<\/mn><\/msub><annotation encoding=\"application\/x-tex\">H_0<\/annotation><\/semantics><\/math><\/li>\n\n\n\n<li class=\"wp-block-list-item\">This suggests that at least one group differs significantly<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Therefore, for a given quantitative explanatory variable, if the p-value is less than 5%, the null hypothesis is rejected, and we would conclude that the considered explanatory variable may be predictive in the model.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.1.3 Qualitative Variable vs. Binary Target<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If the explanatory variable is qualitative, the appropriate tool is the contingency table, which summarizes the joint distribution of the two variables.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It shows how the categories of the explanatory variable are distributed across the two classes of the target. For example the relationship between person_home_ownership and the default variable, the contingency table is given by :<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def contingency_analysis(\n    df,\n    var1,\n    var2,\n    normalize=None,   # None, &quot;index&quot;, &quot;columns&quot;, &quot;all&quot;\n    plot=True,\n    figsize=(8, 6)\n):\n    &quot;&quot;&quot;\n    function to compute and visualize contingency table\n    + Chi-square test + Cram\u00e9r&#039;s V.\n    &quot;&quot;&quot;\n\n    # --- Contingency table ---\n    table = pd.crosstab(df[var1], df[var2], margins=False)\n\n    # --- Normalized version (optional) ---\n    if normalize:\n        table_norm = pd.crosstab(df[var1], df[var2], normalize=normalize, margins=False).round(3) * 100\n    else:\n        table_norm = None\n\n    # --- Plot (heatmap) ---\n    if plot:\n        sns.set_style(&quot;white&quot;)\n        plt.figure(figsize=figsize)\n\n        data_to_plot = table_norm if table_norm is not None else table\n\n        sns.heatmap(\n            data_to_plot,\n            annot=True,\n            fmt=&quot;.2f&quot; if normalize else &quot;d&quot;,\n            cbar=True\n        )\n\n        plt.title(f&quot;{var1} vs {var2} (Contingency Table)&quot;, loc=&quot;left&quot;, weight=&quot;bold&quot;)\n        plt.xlabel(var2)\n        plt.ylabel(var1)\n\n        sns.despine()\n        plt.tight_layout()\n        plt.show()\n\n  \n<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/03\/image-357.png\" alt=\"\" class=\"wp-image-652697\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">From this table, we can:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Compare conditional distributions across categories. <\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/03\/image-354.png\" alt=\"\" class=\"wp-image-652694\"\/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Borrowers who <strong>rent or fall into \u201cother\u201d categories default more often<\/strong>, while <strong>homeowners have the lowest default rate<\/strong>.<br>Mortgage holders are in between, suggesting moderate risk.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">For visualization, <strong>grouped bar charts<\/strong> are often used. They provide an intuitive way to compare conditional proportions across categories.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def plot_grouped_bar(df, cat_var, subcat_var,\n                          normalize=&quot;index&quot;, title=&quot;&quot;):\n    ct = pd.crosstab(df[subcat_var], df[cat_var], normalize=normalize) * 100\n    modalities = ct.index.tolist()\n    categories = ct.columns.tolist()\n    n_mod = len(modalities)\n    n_cat = len(categories)\n    x = np.arange(n_mod)\n    width = 0.35\n\n    colors = [&#039;#0F6E56&#039;, &#039;#993C1D&#039;]  # teal = non-d\u00e9faut, coral = d\u00e9faut\n\n    fig, ax = plt.subplots(figsize=(7.24, 4.07), dpi=100)\n\n    for i, (cat, color) in enumerate(zip(categories, colors)):\n        offset = (i - n_cat \/ 2 + 0.5) * width\n        ax.bar(x + offset, ct[cat], width=width, color=color, label=str(cat))\n\n        # Annotations au-dessus de chaque barre\n        for j, val in enumerate(ct[cat]):\n            ax.text(x[j] + offset, val + 0.5, f&quot;{val:.1f}%&quot;,\n                    ha=&#039;center&#039;, va=&#039;bottom&#039;, fontsize=9, color=&#039;#444&#039;)\n\n    # Style Cole\n    ax.spines[&#039;top&#039;].set_visible(False)\n    ax.spines[&#039;right&#039;].set_visible(False)\n    ax.spines[&#039;left&#039;].set_visible(False)\n    ax.yaxis.grid(True, color=&#039;#e0e0e0&#039;, linewidth=0.8, zorder=0)\n    ax.set_axisbelow(True)\n    ax.set_xticks(x)\n    ax.set_xticklabels(modalities, fontsize=11)\n    ax.set_ylabel(&quot;Taux (%)&quot; if normalize else &quot;Effectifs&quot;, fontsize=11, color=&#039;#555&#039;)\n    ax.tick_params(left=False, colors=&#039;#555&#039;)\n\n\n    handles = [mpatches.Patch(color=c, label=str(l))\n               for c, l in zip(colors, categories)]\n    ax.legend(handles=handles, title=cat_var, frameon=False,\n              fontsize=10, loc=&#039;upper right&#039;)\n\n    ax.set_title(title, fontsize=13, fontweight=&#039;normal&#039;, pad=14)\n    plt.tight_layout()\n    plt.savefig(&quot;default_by_ownership.png&quot;, dpi=150, bbox_inches=&#039;tight&#039;)\n    plt.show()\n\nplot_grouped_bar(\n    df=train_imputed,\n    cat_var=&quot;def&quot;,\n    subcat_var=&quot;person_home_ownership&quot;,\n    normalize=&quot;index&quot;,\n    title=&quot;Default Rate by Home Ownership&quot;\n)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/03\/image-355.png\" alt=\"\" class=\"wp-image-652695\"\/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">1.1.4 Statistical Test: Analysis of the Link Between Default and Qualitative Explanatory Variables<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The statistical test used is the <strong>chi-square test<\/strong>, which is a test of independence.<br>It aims to compare two variables in a contingency table to determine whether they are related. More generally, it assesses whether the distributions of categorical variables differ from each other.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A small chi-square statistic indicates that the observed data are close to the expected data under independence. In other words, there is no evidence of a relationship between the variables.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Conversely, a large chi-square statistic indicates a greater discrepancy between observed and expected frequencies, suggesting a potential relationship between the variables. If the <strong>p-value<\/strong> of the chi-square test is below 5%, we reject the null hypothesis of independence and conclude that the variables are dependent.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, this test does not measure the strength of the relationship and is sensitive to both the sample size<strong> <\/strong>and the <strong>structure of the categories. <\/strong>This is why we turn to <strong>Cramer\u2019s V<\/strong>, which provides a more informative measure of association.<br>The <strong>Cramer&#8217;s V<\/strong> is derived from a chi-square independence test and quantifies the intensity of the relation between two qualitative variables <math data-latex=\"X_1 \"><semantics><msub><mi>X<\/mi><mn>1<\/mn><\/msub><annotation encoding=\"application\/x-tex\">X_1 <\/annotation><\/semantics><\/math> and <math data-latex=\"X_2\"><semantics><msub><mi>X<\/mi><mn>2<\/mn><\/msub><annotation encoding=\"application\/x-tex\">X_2<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The coefficient can be expressed as follows:<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><math data-latex=\"V = \\sqrt{\\frac{\\varphi^2}{\\min(k - 1, r - 1)}}  = \\sqrt{\\frac{\\chi^2 \/ n}{\\min(k - 1, r - 1)}} \"><semantics><mrow><mi>V<\/mi><mo>=<\/mo><msqrt><mfrac><msup><mi>\u03c6<\/mi><mn>2<\/mn><\/msup><mrow><mrow><mi>min<\/mi><mo>\u2061<\/mo><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>k<\/mi><mo>\u2212<\/mo><mn>1<\/mn><mo separator=\"true\">,<\/mo><mi>r<\/mi><mo>\u2212<\/mo><mn>1<\/mn><mo form=\"postfix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">)<\/mo><\/mrow><\/mfrac><\/msqrt><mo>=<\/mo><msqrt><mfrac><mrow><msup><mi>\u03c7<\/mi><mn>2<\/mn><\/msup><mi>\/<\/mi><mi>n<\/mi><\/mrow><mrow><mrow><mi>min<\/mi><mo>\u2061<\/mo><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>k<\/mi><mo>\u2212<\/mo><mn>1<\/mn><mo separator=\"true\">,<\/mo><mi>r<\/mi><mo>\u2212<\/mo><mn>1<\/mn><mo form=\"postfix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">)<\/mo><\/mrow><\/mfrac><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">V = \\sqrt{\\frac{\\varphi^2}{\\min(k &#8211; 1, r &#8211; 1)}}  = \\sqrt{\\frac{\\chi^2 \/ n}{\\min(k &#8211; 1, r &#8211; 1)}} <\/annotation><\/semantics><\/math><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"> <math data-latex=\"{\\displaystyle \\varphi }\"><semantics><mstyle scriptlevel=\"0\" displaystyle=\"true\"><mi>\u03c6<\/mi><\/mstyle><annotation encoding=\"application\/x-tex\">{\\displaystyle \\varphi }<\/annotation><\/semantics><\/math> is the phi coefficient.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><math data-latex=\"{\\displaystyle \\chi ^{2}}\"><semantics><mstyle scriptlevel=\"0\" displaystyle=\"true\"><msup><mi>\u03c7<\/mi><mn>2<\/mn><\/msup><\/mstyle><annotation encoding=\"application\/x-tex\">{\\displaystyle \\chi ^{2}}<\/annotation><\/semantics><\/math> is derived from Pearson&#8217;s chi-squared test or contingency table.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">n is the total number of observations and<\/li>\n\n\n\n<li class=\"wp-block-list-item\">k being the number of columns of the contingency table<\/li>\n\n\n\n<li class=\"wp-block-list-item\">r being the number of rows of the contingency table.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Cram\u00e9r\u2019s V takes values between 0 and 1. Depending on its value, the strength of the association can be interpreted as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>&gt; 0.5<\/strong> \u2192 High association<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>0.3 \u2013 0.5<\/strong> \u2192 Moderate association<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>0.1 \u2013 0.3<\/strong> \u2192 Low association<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>0 \u2013 0.1<\/strong> \u2192 Little to no association<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">For example, we can consider that a variable is significantly associated with the target when Cram\u00e9r\u2019s V exceeds a given threshold (0.5 or 50%), depending on the level of selectivity required for the analysis.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Graphical tools are commonly used to assess the discriminatory power of variables. They can also help evaluate the relationships between explanatory variables. This analysis aims to reduce the number of variables by identifying those that provide redundant information.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It is typically conducted on variables of the same type\u2014continuous variables with continuous variables, or categorical variables with categorical variables\u2014since specific measures are designed for each case. For example, we can use <strong>Spearman correlation<\/strong> for continuous variables, or <strong>Cram\u00e9r\u2019s V<\/strong> and <strong>Tschuprow\u2019s T<\/strong> for categorical variables to quantify the strength of association.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the following section, we assume that the available variables are relevant for discriminating default. It therefore becomes appropriate to use statistical tests to further investigate the relationships between variables. We will describe a structured methodology for selecting the appropriate tests and provide clear justification for these choices.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The goal is not to cover every possible test, but rather to present a coherent and robust approach that can guide you in building a reliable scoring model.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1.2 Multicollinearity Between Variables<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In credit scoring, when we talk about multicollinearity, the first thing that usually comes to mind is the <strong>Variance Inflation Factor (VIF)<\/strong>. However, there is a much simpler approach that can be used when dealing with a large number of explanatory variables. This approach allows for an initial screening of relevant variables and helps reduce dimensionality by analyzing the relationships between variables of the same type.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the following sections, we show how studying the relationships between <strong>continuous variables<\/strong> and between <strong>categorical variables<\/strong> can help identify redundant information and support the preselection of explanatory variables.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.2.1 Test statistic for the study: Relationship Between Continuous Explanatory Variables<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In scoring models, analyzing the relationship between two continuous variables is generally used to <strong>pre-select variables<\/strong> and <strong>reduce dimensionality<\/strong>. This analysis becomes particularly relevant when the number of explanatory variables is very large (e.g., more than 100), as it can significantly reduce the number of variables.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this section, we focus on the case of two continuous explanatory variables. In the next section, we will examine the case of two categorical variables.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To study this association, the Pearson correlation can be used. However, in most cases, the Spearman correlation is preferred, as it is a non-parametric measure. In contrast, Pearson correlation only captures linear relationships between variables.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Spearman correlation is often preferred in practice because it is robust to outliers and does not rely on distributional assumptions. It measures how well the relationship between two variables can be described by a monotonic function, whether linear or not.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Mathematically, it is computed by applying the Pearson correlation formula to the ranked variables:<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\"><math data-latex=\"\\rho_{X,Y} = \\frac{\\mathrm{Cov}(\\mathrm{Rank}_X, \\mathrm{Rank}_Y)}{\\sigma_{\\mathrm{Rank}_X} \\, \\sigma_{\\mathrm{Rank}_Y}}\"><semantics><mrow><msub><mi>\u03c1<\/mi><mrow><mi>X<\/mi><mo separator=\"true\">,<\/mo><mi>Y<\/mi><\/mrow><\/msub><mo>=<\/mo><mfrac><mrow><mrow><mtext><\/mtext><mi>Cov<\/mi><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mrow><mtext><\/mtext><mi>Rank<\/mi><\/mrow><mi>X<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mrow><mtext><\/mtext><mi>Rank<\/mi><\/mrow><mi>Y<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">)<\/mo><\/mrow><mrow><msub><mi>\u03c3<\/mi><msub><mrow><mtext><\/mtext><mi>Rank<\/mi><\/mrow><mi scriptlevel=\"2\">X<\/mi><\/msub><\/msub><mspace width=\"0.1667em\"><\/mspace><msub><mi>\u03c3<\/mi><msub><mrow><mtext><\/mtext><mi>Rank<\/mi><\/mrow><mi scriptlevel=\"2\">Y<\/mi><\/msub><\/msub><\/mrow><\/mfrac><\/mrow><annotation encoding=\"application\/x-tex\">\\rho_{X,Y} = \\frac{\\mathrm{Cov}(\\mathrm{Rank}_X, \\mathrm{Rank}_Y)}{\\sigma_{\\mathrm{Rank}_X} \\, \\sigma_{\\mathrm{Rank}_Y}}<\/annotation><\/semantics><\/math><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Therefore, in this context, <strong>Spearman correlation<\/strong> is selected to assess the relationship between two continuous variables.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If two or more independent continuous variables exhibit a high pairwise Spearman correlation (e.g., \u2265 0.6 or 60%), this suggests that they carry similar information. In such cases, it is appropriate to retain only one of them\u2014either the variable that is most strongly correlated with the target (default) or the one considered most relevant based on domain expertise.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.2.2 Test statistic for the study: Relationship Between Qualitative Explanatory Variables.<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">As in the analysis of the relationship between an explanatory variable and the target (default), Cram\u00e9r\u2019s V is used here to assess whether two or more qualitative variables provide the same information.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, if Cram\u00e9r\u2019s V exceeds 0.5 (50%), the variables are considered to be highly associated and may capture similar information. Therefore, they should not be included simultaneously in the model, as this would introduce redundancy.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The choice of which variable to retain can be based on statistical criteria\u2014such as keeping the variable that is most strongly associated with the target (default)\u2014or on domain expertise, by selecting the variable considered the most relevant from a business perspective.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As you may have noticed, we study the relationship between a continuous variable and a categorical variable as part of the dimensionality reduction process, since there is no direct indicator to measure the strength of the association, unlike Spearman correlation or Cram\u00e9r\u2019s V.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For those interested, one possible approach is to use the Variance Inflation Factor (VIF). We will cover this in a future publication. It is not discussed here because the methodology for computing VIF may differ depending on whether you use Python or R. These specific aspects will be addressed in the next post.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the following section, we will apply everything discussed so far to real-world data, specifically the dataset introduced in our previous article.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1.3 Application in the real data <\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This section analyzes the correlations between variables and contributes to the pre-selection of variables. The data used are those from the previous article, where outliers and missing values were already treated.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Three types of correlations (each using a different statistical test seen above) are analyzed :<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Correlation between continuous variables and the default variable (Kruskall-Wallis test)<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Correlations between qualitatives variables and the default variables (Cramer&#8217;s V).<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Multi-correlations between continuous variables (Spearman test)<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Multi-correlations between qualitatives variables (Cramer&#8217;s V)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">1.3.1 Correlation between continuous variables and the default variable<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the train database, we have seven continuous variables : <\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">person_income<\/li>\n\n\n\n<li class=\"wp-block-list-item\">person_age<\/li>\n\n\n\n<li class=\"wp-block-list-item\">person_emp_length<\/li>\n\n\n\n<li class=\"wp-block-list-item\">loan_amnt<\/li>\n\n\n\n<li class=\"wp-block-list-item\">loan_int_rate<\/li>\n\n\n\n<li class=\"wp-block-list-item\">loan_percent_income<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The table below presents the p-values from the <strong>Kruskal\u2013Wallis test<\/strong>, which measure the relationship between these variables and the default variable.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">\ndef correlation_quanti_def_KW(database: pd.DataFrame,\n                              continuous_vars: list,\n                              target: str) -&gt; pd.DataFrame:\n    &quot;&quot;&quot;\n    Compute Kruskal-Wallis test p-values between continuous variables\n    and a categorical (binary or multi-class) target.\n\n    Parameters\n    ----------\n    database : pd.DataFrame\n        Input dataset\n    continuous_vars : list\n        List of continuous variable names\n    target : str\n        Target variable name (categorical)\n\n    Returns\n    -------\n    pd.DataFrame\n        Table with variables and corresponding p-values\n    &quot;&quot;&quot;\n\n    results = []\n\n    for var in continuous_vars:\n        # Drop NA for current variable + target\n        df = database[[var, target]].dropna()\n\n        # Group values by target categories\n        groups = [\n            group[var].values\n            for _, group in df.groupby(target)\n        ]\n\n        # Kruskal-Wallis requires at least 2 groups\n        if len(groups) &lt; 2:\n            p_value = None\n        else:\n            try:\n                stat, p_value = kruskal(*groups)\n            except ValueError:\n                # Handles edge cases (e.g., constant values)\n                p_value = None\n\n        results.append({\n            &quot;variable&quot;: var,\n            &quot;p_value&quot;: p_value,\n            &quot;stats_kw&quot;: stat if &#039;stat&#039; in locals() else None\n        })\n       \n    return pd.DataFrame(results).sort_values(by=&quot;p_value&quot;)\n\ncontinuous_vars = [\n    &quot;person_income&quot;,\n    &quot;person_age&quot;,\n    &quot;person_emp_length&quot;,\n    &quot;loan_amnt&quot;,\n    &quot;loan_int_rate&quot;,\n    &quot;loan_percent_income&quot;,\n    &quot;cb_person_cred_hist_length&quot;\n]\ntarget = &quot;def&quot;\nresult = correlation_quanti_def_KW(\n    database=train_imputed,\n    continuous_vars=continuous_vars,\n    target=target\n)\n\nprint(result)\n\n# Save results to xlsx\nresult.to_excel(f&quot;{data_output_path}\/correlation\/correlations_kw.xlsx&quot;, index=False)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/03\/image-358.png\" alt=\"\" class=\"wp-image-652705\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">By comparing the p-values to the 5% significance level, we observe that all are below the threshold. Therefore, we reject the null hypothesis for all variables and conclude that each continuous variable is significantly associated with the default variable.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.3.2 Correlations between qualitative variables and the default variables (Cramer&#8217;s V).<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the database, we have four qualitative variables :<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">person_home_ownership<\/li>\n\n\n\n<li class=\"wp-block-list-item\">cb_person_default_on_file<\/li>\n\n\n\n<li class=\"wp-block-list-item\">loan_intent<\/li>\n\n\n\n<li class=\"wp-block-list-item\">loan_grade<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The table below reports the strength of the association between these categorical variables and the default variable, as measured by Cram\u00e9r\u2019s V<strong>.<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def cramers_v_with_target(database: pd.DataFrame,\n                          categorical_vars: list,\n                          target: str) -&gt; pd.DataFrame:\n    &quot;&quot;&quot;\n    Compute Chi-square statistic and Cram\u00e9r&#039;s V between multiple\n    categorical variables and a target variable.\n\n    Parameters\n    ----------\n    database : pd.DataFrame\n        Input dataset\n    categorical_vars : list\n        List of categorical variables\n    target : str\n        Target variable (categorical)\n\n    Returns\n    -------\n    pd.DataFrame\n        Table with variable, chi2 and Cram\u00e9r&#039;s V\n    &quot;&quot;&quot;\n\n    results = []\n\n    for var in categorical_vars:\n        # Drop missing values\n        df = database[[var, target]].dropna()\n\n        # Contingency table\n        contingency_table = pd.crosstab(df[var], df[target])\n\n        # Skip if not enough data\n        if contingency_table.shape[0] &lt; 2 or contingency_table.shape[1] &lt; 2:\n            results.append({\n                &quot;variable&quot;: var,\n                &quot;chi2&quot;: None,\n                &quot;cramers_v&quot;: None\n            })\n            continue\n\n        try:\n            chi2, _, _, _ = chi2_contingency(contingency_table)\n            n = contingency_table.values.sum()\n            r, k = contingency_table.shape\n\n            v = np.sqrt((chi2 \/ n) \/ min(k - 1, r - 1))\n\n        except Exception:\n            chi2, v = None, None\n\n        results.append({\n            &quot;variable&quot;: var,\n            &quot;chi2&quot;: chi2,\n            &quot;cramers_v&quot;: v\n        })\n\n    result_df = pd.DataFrame(results)\n\n    # Option : tri par importance\n    return result_df.sort_values(by=&quot;cramers_v&quot;, ascending=False)\n\nqualitative_vars = [\n    &quot;person_home_ownership&quot;,\n    &quot;cb_person_default_on_file&quot;,\n    &quot;loan_intent&quot;,\n    &quot;loan_grade&quot;,\n]\nresult = cramers_v_with_target(\n    database=train_imputed,\n    categorical_vars=qualitative_vars,\n    target=target\n)\n\nprint(result)\n\n# Save results to xlsx\nresult.to_excel(f&quot;{data_output_path}\/correlation\/cramers_v.xlsx&quot;, index=False)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/03\/image-359.png\" alt=\"\" class=\"wp-image-652707\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The results indicate that most variables are associated with the default variable. A <strong>moderate association<\/strong> is observed for <em>loan_grade<\/em>, while the other categorical variables exhibit <strong>weak associations<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.3.3 Multi-correlations between continuous variables (Spearman test)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To identify continuous variables that provide similar information, we use the <strong>Spearman correlation<\/strong> with a threshold of <strong>60%<\/strong>. That is, if two continuous explanatory variables exhibit a Spearman correlation above 60%, they are considered redundant and to capture similar information.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def correlation_matrix_quanti(database: pd.DataFrame,\n                              continuous_vars: list,\n                              method: str = &quot;spearman&quot;,\n                              as_percent: bool = False) -&gt; pd.DataFrame:\n    &quot;&quot;&quot;\n    Compute correlation matrix for continuous variables.\n\n    Parameters\n    ----------\n    database : pd.DataFrame\n        Input dataset\n    continuous_vars : list\n        List of continuous variables\n    method : str\n        Correlation method (&quot;pearson&quot; or &quot;spearman&quot;), default = &quot;spearman&quot;\n    as_percent : bool\n        If True, return values in percentage\n\n    Returns\n    -------\n    pd.DataFrame\n        Correlation matrix\n    &quot;&quot;&quot;\n\n    # Select relevant data and drop rows with NA\n    df = database[continuous_vars].dropna()\n\n    # Compute correlation matrix\n    corr_matrix = df.corr(method=method)\n\n    # Convert to percentage if required\n    if as_percent:\n        corr_matrix = corr_matrix * 100\n\n    return corr_matrix\n\ncorr = correlation_matrix_quanti(\n    database=train_imputed,\n    continuous_vars=continuous_vars,\n    method=&quot;spearman&quot;\n)\n\nprint(corr)\n\n# Save results to xlsx\ncorr.to_excel(f&quot;{data_output_path}\/correlation\/correlation_matrix_spearman.xlsx&quot;)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/03\/image-361-1024x178.png\" alt=\"\" class=\"wp-image-652709\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">We identify two pairs of variables that are highly correlated:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The pair <em>(cb_person_cred_hist_length, person_age)<\/em> with a correlation of <strong>85%<\/strong><\/li>\n\n\n\n<li class=\"wp-block-list-item\">The pair <em>(loan_percent_income, loan_amnt)<\/em> with a high correlation<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Only one variable from each pair should be retained for modeling. We rely on statistical criteria to select the variable that is most strongly associated with the default variable. In this case, we retain person_age and loan_percent_income.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.3.4 Multi-correlations between qualitative variables (Cramer&#8217;s V)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In this section, we analyze the relationships between categorical variables. If two categorical variables are associated with a Cram\u00e9r\u2019s V greater than 60%, one of them should be removed from the candidate risk driver list to avoid introducing highly correlated variables into the model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The choice between the two variables can be based on expert judgment. However, in this case, we rely on a statistical approach and select the variable that is most strongly associated with the default variable.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The table below presents the Cram\u00e9r\u2019s V matrix computed for each pair of categorical explanatory variables.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def cramers_v_matrix(database: pd.DataFrame,\n                     categorical_vars: list,\n                     corrected: bool = False,\n                     as_percent: bool = False) -&gt; pd.DataFrame:\n    &quot;&quot;&quot;\n    Compute Cram\u00e9r&#039;s V correlation matrix for categorical variables.\n\n    Parameters\n    ----------\n    database : pd.DataFrame\n        Input dataset\n    categorical_vars : list\n        List of categorical variables\n    corrected : bool\n        Apply bias correction (recommended)\n    as_percent : bool\n        Return values in percentage\n\n    Returns\n    -------\n    pd.DataFrame\n        Cram\u00e9r&#039;s V matrix\n    &quot;&quot;&quot;\n\n    def cramers_v(x, y):\n        # Drop NA\n        df = pd.DataFrame({&quot;x&quot;: x, &quot;y&quot;: y}).dropna()\n\n        contingency_table = pd.crosstab(df[&quot;x&quot;], df[&quot;y&quot;])\n\n        if contingency_table.shape[0] &lt; 2 or contingency_table.shape[1] &lt; 2:\n            return np.nan\n\n        chi2, _, _, _ = chi2_contingency(contingency_table)\n        n = contingency_table.values.sum()\n        r, k = contingency_table.shape\n\n        phi2 = chi2 \/ n\n\n        if corrected:\n            # Bergsma correction\n            phi2_corr = max(0, phi2 - ((k-1)*(r-1)) \/ (n-1))\n            r_corr = r - ((r-1)**2) \/ (n-1)\n            k_corr = k - ((k-1)**2) \/ (n-1)\n            denom = min(k_corr - 1, r_corr - 1)\n        else:\n            denom = min(k - 1, r - 1)\n\n        if denom &lt;= 0:\n            return np.nan\n\n        return np.sqrt(phi2_corr \/ denom) if corrected else np.sqrt(phi2 \/ denom)\n\n    # Initialize matrix\n    n = len(categorical_vars)\n    matrix = pd.DataFrame(np.zeros((n, n)),\n                          index=categorical_vars,\n                          columns=categorical_vars)\n\n    # Fill matrix\n    for i, var1 in enumerate(categorical_vars):\n        for j, var2 in enumerate(categorical_vars):\n            if i &lt;= j:\n                value = cramers_v(database[var1], database[var2])\n                matrix.loc[var1, var2] = value\n                matrix.loc[var2, var1] = value\n\n    # Convert to percentage\n    if as_percent:\n        matrix = matrix * 100\n\n    return matrix\nmatrix = cramers_v_matrix(\n    database=train_imputed,\n    categorical_vars=qualitative_vars,\n)\n\nprint(matrix)\n# Save results to xlsx\nmatrix.to_excel(f&quot;{data_output_path}\/correlation\/cramers_v_matrix.xlsx&quot;)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/03\/image-362-1024x186.png\" alt=\"\" class=\"wp-image-652712\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">From this table, using a 60% threshold, we observe that only one pair of variables is strongly associated: <em>(loan_grade, cb_person_default_on_file)<\/em>. The variable we retain is <strong>loan_grade<\/strong>, as it is more strongly associated with the default variable.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Based on these analyses, we have pre-selected <strong>9 variables<\/strong> for the next steps. Two variables were removed during the analysis of correlations between continuous variables, and one variable was removed during the analysis of correlations between categorical variables.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Conclusion<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The objective of this post was to present how to measure the different relationships that exist between variables in a credit scoring model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We have seen that this analysis can be used to evaluate the discriminatory power of explanatory variables, that is, their ability to predict the default variable. When the explanatory variable is continuous, we can rely on the <strong>non-parametric Kruskal\u2013Wallis test<\/strong> to assess the relationship between the variable and default.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When the explanatory variable is categorical, we use <strong>Cram\u00e9r\u2019s V<\/strong>, which measures the strength of the association and is less sensitive to sample size than the chi-square test alone.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, we have shown that analyzing relationships between variables also helps reduce dimensionality by identifying multicollinearity, especially when variables are of the same type.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For two continuous explanatory variables, we can use the <strong>Spearman correlation<\/strong>, with a threshold (e.g., 60%). If the Spearman correlation exceeds this threshold, the two variables are considered redundant and should not both be included in the model. One can then be selected based on its relationship with the default variable or based on domain expertise.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For two categorical explanatory variables, we again use Cram\u00e9r\u2019s V. By setting a threshold (e.g., 50%), we can assume that if Cram\u00e9r\u2019s V exceeds this value, the variables provide similar information. In this case, only one of the two variables should be retained\u2014either based on its discriminatory power or through expert judgment.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, we applied these methods to the dataset processed in our previous post. While this approach is effective, it is not the most robust method for variable selection. In our next post, we will present a more robust approach for pre-selecting variables in a scoring model.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Image Credits<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">[1]&nbsp;<strong>Lorenzo Beretta and Alessandro Santaniello.<\/strong><br><em>Nearest Neighbor Imputation Algorithms: A Critical Evaluation.<\/em><br>National Library of Medicine, 2016.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[2]&nbsp;<strong>Nexialog Consulting.<\/strong><br><em>Traitement des donn\u00e9es manquantes dans le milieu bancaire.<\/em><br>Working paper, 2022.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[3]&nbsp;<strong>John T. Hancock and Taghi M. Khoshgoftaar.<\/strong><br><em>Survey on Categorical Data for Neural Networks.<\/em><br>Journal of Big Data, 7(28), 2020.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[4]&nbsp;<strong>Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.<\/strong><br><em>Multiple Imputation by Chained Equations: What Is It and How Does It Work?<\/em><br>International Journal of Methods in Psychiatric Research, 2011.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[5]&nbsp;<strong>Majid Sarmad.<\/strong><br><em>Robust Data Analysis for Factorial Experimental Designs: Improved Methods and Software.<\/em><br>Department of Mathematical Sciences, University of Durham, England, 2006.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[6]&nbsp;<strong>Daniel J. Stekhoven and Peter B\u00fchlmann.<\/strong><br><em>MissForest\u2014Non-Parametric Missing Value Imputation for Mixed-Type Data.<\/em>Bioinformatics, 2011.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[7]&nbsp;<strong>Supriyanto Wibisono, Anwar, and Amin.<\/strong><br><em>Multivariate Weather Anomaly Detection Using the DBSCAN Clustering Algorithm.<\/em><br>Journal of Physics: Conference Series, 2021.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[8] <strong>Laborda, J., &amp; Ryoo, S.<\/strong> (2021). Feature selection in a credit scoring model.&nbsp;<em>Mathematics<\/em>,&nbsp;<em>9<\/em>(7), 746.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Data &amp; Licensing<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The dataset used in this article is licensed under the&nbsp;<strong>Creative Commons Attribution 4.0 International (CC BY 4.0)<\/strong>&nbsp;license.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This license allows anyone to share and adapt the dataset for any purpose, including commercial use, provided that proper attribution is given to the source.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For more details, see the official license text:&nbsp;<a href=\"https:\/\/creativecommons.org\/publicdomain\/zero\/1.0\/\" target=\"_blank\" rel=\"noreferrer noopener\">CC0: Public Domain<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Disclaimer<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Any remaining errors or inaccuracies are the author\u2019s responsibility. Feedback and corrections are welcome.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Practical Guide to Measuring Relationships between Variables for Feature Selection in a Credit Scoring.<\/p>\n","protected":false},"author":18,"featured_media":608752,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"A Practical Guide to Measuring Relationships between Variables for Feature Selection in a Credit Scoring.","footnotes":""},"categories":[44],"tags":[448,468,446,491,467],"sponsor":[],"coauthors":[29237],"class_list":["post-608751","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","tag-data-science","tag-deep-dives","tag-machine-learning","tag-programming","tag-python"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Building Robust Credit Scoring Models with Python | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Building Robust Credit Scoring Models with Python | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"A Practical Guide to Measuring Relationships between Variables for Feature Selection in a Credit Scoring.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2026-04-04T12:48:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2026\/04\/image_by_autor.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1536\" \/>\n\t<meta property=\"og:image:height\" content=\"1024\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"JUNIOR JUMBONG\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"JUNIOR JUMBONG\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"18 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Building Robust Credit Scoring Models with Python\",\"datePublished\":\"2026-04-04T12:48:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/\"},\"wordCount\":3731,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2026\/04\/image_by_autor.jpg\",\"keywords\":[\"Data Science\",\"Deep Dives\",\"Machine Learning\",\"Programming\",\"Python\"],\"articleSection\":[\"Data Science\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/\",\"url\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/\",\"name\":\"Building Robust Credit Scoring Models with Python | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2026\/04\/image_by_autor.jpg\",\"datePublished\":\"2026-04-04T12:48:00+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2026\/04\/image_by_autor.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2026\/04\/image_by_autor.jpg\",\"width\":1536,\"height\":1024,\"caption\":\"Image by author and CHATGPT\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Building Robust Credit Scoring Models with Python\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Building Robust Credit Scoring Models with Python | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/","og_locale":"en_US","og_type":"article","og_title":"Building Robust Credit Scoring Models with Python | Towards Data Science","og_description":"A Practical Guide to Measuring Relationships between Variables for Feature Selection in a Credit Scoring.","og_url":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/","og_site_name":"Towards Data Science","article_published_time":"2026-04-04T12:48:00+00:00","og_image":[{"width":1536,"height":1024,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2026\/04\/image_by_autor.jpg","type":"image\/jpeg"}],"author":"JUNIOR JUMBONG","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"JUNIOR JUMBONG","Est. reading time":"18 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Building Robust Credit Scoring Models with Python","datePublished":"2026-04-04T12:48:00+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/"},"wordCount":3731,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2026\/04\/image_by_autor.jpg","keywords":["Data Science","Deep Dives","Machine Learning","Programming","Python"],"articleSection":["Data Science"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/","url":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/","name":"Building Robust Credit Scoring Models with Python | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2026\/04\/image_by_autor.jpg","datePublished":"2026-04-04T12:48:00+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2026\/04\/image_by_autor.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2026\/04\/image_by_autor.jpg","width":1536,"height":1024,"caption":"Image by author and CHATGPT"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/building-robust-credit-scoring-models-with-python\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Building Robust Credit Scoring Models with Python"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"TDS Contributor Portal","distributor_original_site_url":"https:\/\/contributor.insightmediagroup.io","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/608751","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=608751"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/608751\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/608752"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=608751"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=608751"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=608751"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=608751"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=608751"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}