{"id":200,"date":"2024-07-22T12:42:06","date_gmt":"2024-07-22T16:42:06","guid":{"rendered":"https:\/\/www.econai.tech\/?page_id=200"},"modified":"2026-05-06T08:47:51","modified_gmt":"2026-05-06T12:47:51","slug":"exploratory-data-analysis-visualizations","status":"publish","type":"page","link":"https:\/\/tomomitanaka.ai\/?page_id=200","title":{"rendered":"Visualization"},"content":{"rendered":"\n<div class=\"wp-block-jin-gb-block-box-with-headline kaisetsu-box1\"><div class=\"kaisetsu-box1-title\">Safety by Design Expert&#8217;s Note:<\/div>\n<p>For safety experts, data visualization is a critical tool in identifying potential risks and biases in AI systems. Effective visualization can reveal hidden patterns, outliers, and relationships in data that may lead to unfair or unsafe outcomes. <\/p>\n\n\n\n<p>By mastering these techniques, you can better detect and mitigate safety issues early in the AI development process, ensuring more robust and equitable systems. <\/p>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Introduction<\/h3>\n\n\n\n<p>Data visualization is a critical step in the data analysis process. It helps us understand the distribution, trends, and relationships within the data. <\/p>\n\n\n\n<p>Using libraries like Matplotlib, Seaborn, and Plotly, we\u2019ll explore the &#8220;<a href=\"https:\/\/www.kaggle.com\/competitions\/house-prices-advanced-regression-techniques\">House Prices \u2013 Advanced Regression Techniques<\/a>&#8221; dataset from Kaggle to uncover hidden patterns and gain insights that can inform our predictive models.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Dataset Overview<\/h3>\n\n\n\n<p>The &#8220;<a href=\"https:\/\/www.kaggle.com\/competitions\/house-prices-advanced-regression-techniques\">House Prices \u2013 Advanced Regression Techniques<\/a>&#8221; dataset provides a comprehensive set of features describing the properties of houses in Ames, Iowa. The target variable is the sale price of the houses. The dataset includes features such as:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>MSSubClass<\/strong>: The building class<\/li>\n\n\n\n<li><strong>MSZoning<\/strong>: The general zoning classification<\/li>\n\n\n\n<li><strong>LotArea<\/strong>: Lot size in square feet<\/li>\n\n\n\n<li><strong>Street<\/strong>: Type of road access<\/li>\n\n\n\n<li><strong>YearBuilt<\/strong>: Original construction date<\/li>\n\n\n\n<li><strong>GrLivArea<\/strong>: Above grade (ground) living area in square feet<\/li>\n\n\n\n<li><strong>OverallQual<\/strong>: Rates the overall material and finish of the house<\/li>\n\n\n\n<li><strong>SalePrice<\/strong>: The property&#8217;s sale price<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Getting Started with Data Visualization<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">1. Matplotlib: Basic Plotting<\/h4>\n\n\n\n<p>Matplotlib is a versatile plotting library for creating static, interactive, and animated visualizations in Python. It provides control over every aspect of a figure.<\/p>\n\n\n\n<p><strong>Example: Sale Price Distribution<\/strong><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Sale Price Distribution\nplt.figure(figsize=(10, 6))\nplt.hist(df['SalePrice'], bins=50, color='skyblue', edgecolor='black')\nplt.title('Distribution of Sale Prices')\nplt.xlabel('Sale Price')\nplt.ylabel('Frequency')\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #6A9955\"># Sale Price Distribution<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.figure(<\/span><span style=\"color: #9CDCFE\">figsize<\/span><span style=\"color: #D4D4D4\">=(<\/span><span style=\"color: #B5CEA8\">10<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #B5CEA8\">6<\/span><span style=\"color: #D4D4D4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.hist(df[<\/span><span style=\"color: #CE9178\">&#39;SalePrice&#39;<\/span><span style=\"color: #D4D4D4\">], <\/span><span style=\"color: #9CDCFE\">bins<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #B5CEA8\">50<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">color<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;skyblue&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">edgecolor<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;black&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.title(<\/span><span style=\"color: #CE9178\">&#39;Distribution of Sale Prices&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.xlabel(<\/span><span style=\"color: #CE9178\">&#39;Sale Price&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.ylabel(<\/span><span style=\"color: #CE9178\">&#39;Frequency&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"855\" height=\"540\" src=\"https:\/\/www.econai.tech\/wp-content\/uploads\/2024\/07\/matplotlib_sale_price_distribution-1.png\" alt=\"\" class=\"wp-image-706\"\/><\/figure>\n\n\n\n<p>The histogram created using Matplotlib reveals that the distribution of sale prices is right-skewed, indicating that most houses are sold at a lower price range, with fewer houses sold at higher prices.<\/p>\n\n\n\n<div class=\"wp-block-jin-gb-block-icon-box jin-icon-caution jin-iconbox\"><div class=\"jin-iconbox-icons\"><i class=\"jic jin-ifont-caution jin-icons\"><\/i><\/div><div class=\"jin-iconbox-main\">\n<p><strong>Safety Implication<\/strong>: The right-skewed distribution of sale prices reveals a potential bias in our dataset towards lower-priced homes.<\/p>\n\n\n\n<p><strong>Why it matters<\/strong>: If not addressed, this skew could lead to a model that performs well for lower-priced homes but poorly for high-end properties. This could result in unfair or inaccurate predictions for certain segments of the housing market.<\/p>\n\n\n\n<p><strong>Mitigation strategy<\/strong>: Consider using techniques like oversampling expensive houses or using weighted loss functions to ensure the model performs equally well across all price ranges.<\/p>\n<\/div><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">2. Seaborn: Statistical Data Visualization<\/h4>\n\n\n\n<p>Seaborn is built on top of Matplotlib and provides a high-level interface for drawing attractive statistical graphics. It is particularly useful for visualizing complex relationships between variables.<\/p>\n\n\n\n<p><strong>Example: Sale Price vs. Overall Quality<\/strong><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Sale Price vs. Overall Quality\nplt.figure(figsize=(10, 6))\nsns.boxplot(x='OverallQual', y='SalePrice', data=df, palette='muted')\nplt.title('Sale Price vs. Overall Quality')\nplt.xlabel('Overall Quality')\nplt.ylabel('Sale Price')\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #6A9955\"># Sale Price vs. Overall Quality<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.figure(<\/span><span style=\"color: #9CDCFE\">figsize<\/span><span style=\"color: #D4D4D4\">=(<\/span><span style=\"color: #B5CEA8\">10<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #B5CEA8\">6<\/span><span style=\"color: #D4D4D4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">sns.boxplot(<\/span><span style=\"color: #9CDCFE\">x<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;OverallQual&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">y<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;SalePrice&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">data<\/span><span style=\"color: #D4D4D4\">=df, <\/span><span style=\"color: #9CDCFE\">palette<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;muted&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.title(<\/span><span style=\"color: #CE9178\">&#39;Sale Price vs. Overall Quality&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.xlabel(<\/span><span style=\"color: #CE9178\">&#39;Overall Quality&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.ylabel(<\/span><span style=\"color: #CE9178\">&#39;Sale Price&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"884\" height=\"543\" src=\"https:\/\/www.econai.tech\/wp-content\/uploads\/2024\/07\/seaborn_sale_price_vs_quality.png\" alt=\"\" class=\"wp-image-708\" srcset=\"https:\/\/tomomitanaka.ai\/wp-content\/uploads\/2024\/07\/seaborn_sale_price_vs_quality.png 884w, https:\/\/tomomitanaka.ai\/wp-content\/uploads\/2024\/07\/seaborn_sale_price_vs_quality-300x184.png 300w, https:\/\/tomomitanaka.ai\/wp-content\/uploads\/2024\/07\/seaborn_sale_price_vs_quality-768x472.png 768w, https:\/\/www.econai.tech\/wp-content\/uploads\/2024\/07\/seaborn_sale_price_vs_quality.png 856w\" sizes=\"auto, (max-width: 884px) 100vw, 884px\" \/><\/figure>\n\n\n\n<p>The boxplot generated with Seaborn shows a clear trend: houses with higher overall quality tend to have higher sale prices. This suggests that the quality of construction and materials significantly impacts the house price.<\/p>\n\n\n\n<div class=\"wp-block-jin-gb-block-icon-box jin-icon-caution jin-iconbox\"><div class=\"jin-iconbox-icons\"><i class=\"jic jin-ifont-caution jin-icons\"><\/i><\/div><div class=\"jin-iconbox-main\">\n<p><strong>Safety Implication<\/strong>: This visualization helps identify potential biases related to housing quality assessment.<\/p>\n\n\n\n<p><strong>Why it matters<\/strong>: If the &#8216;OverallQual&#8217; feature is subjectively determined, it could introduce human biases into the model. For example, if quality assessments are influenced by neighborhood demographics, it could lead to discriminatory pricing predictions.<\/p>\n<\/div><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">3. Plotly: Interactive Visualizations<\/h4>\n\n\n\n<p>Plotly is a graphing library that makes interactive, publication-quality graphs online. It is especially useful for creating plots that you can interact with directly in a web browser.<\/p>\n\n\n\n<p><strong>Example: Lot Area vs. Sale Price<\/strong><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Create a scatter plot using Plotly\nfig = px.scatter(df, x='LotArea', y='SalePrice', color='OverallQual',\n                 title='Lot Area vs. Sale Price',\n                 labels={'LotArea': 'Lot Area (sq ft)', 'SalePrice': 'Sale Price'},\n                 hover_data=['YearBuilt'])\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #6A9955\"># Create a scatter plot using Plotly<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">fig = px.scatter(df, <\/span><span style=\"color: #9CDCFE\">x<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;LotArea&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">y<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;SalePrice&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">color<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;OverallQual&#39;<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">                 <\/span><span style=\"color: #9CDCFE\">title<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;Lot Area vs. Sale Price&#39;<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">                 <\/span><span style=\"color: #9CDCFE\">labels<\/span><span style=\"color: #D4D4D4\">={<\/span><span style=\"color: #CE9178\">&#39;LotArea&#39;<\/span><span style=\"color: #D4D4D4\">: <\/span><span style=\"color: #CE9178\">&#39;Lot Area (sq ft)&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;SalePrice&#39;<\/span><span style=\"color: #D4D4D4\">: <\/span><span style=\"color: #CE9178\">&#39;Sale Price&#39;<\/span><span style=\"color: #D4D4D4\">},<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">                 <\/span><span style=\"color: #9CDCFE\">hover_data<\/span><span style=\"color: #D4D4D4\">=[<\/span><span style=\"color: #CE9178\">&#39;YearBuilt&#39;<\/span><span style=\"color: #D4D4D4\">])<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"679\" height=\"440\" src=\"https:\/\/www.econai.tech\/wp-content\/uploads\/2024\/07\/plotly_lot_area_vs_sale_price.png\" alt=\"\" class=\"wp-image-713\"\/><\/figure>\n\n\n\n<p>The interactive scatter plot created with Plotly allows us to explore the relationship between lot area and sale price dynamically. By coloring the points based on overall quality, we can see that larger lot areas and higher quality ratings generally correspond to higher sale prices.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">4. Correlation Heatmap<\/h4>\n\n\n\n<p>A correlation heatmap is a powerful tool for visualizing the relationships between multiple numerical variables in a dataset. <\/p>\n\n\n\n<p>For the House Prices dataset, this visualization can provide crucial insights into which features are most strongly related to the sale price and to each other.<\/p>\n\n\n\n<p>When dealing with a large number of features, the heatmap can become overwhelming. <\/p>\n\n\n\n<p>To make it more interpretable, you can focus on a subset of features that are most relevant to the target variable SalePrice.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Step 1: Select Relevant Features<\/h5>\n\n\n\n<p>Identify the top features that have the highest correlation with SalePrice.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Step 2: Create a correlation matrix with the top features<\/h5>\n\n\n\n<p><\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Step 3: <strong>Create and Save the Heatmap<\/strong><\/h5>\n\n\n\n<p>Generate the heatmap with the selected features.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Compute the correlation matrix\ncorr_matrix = df.corr()\n# Select the top features that correlate with 'SalePrice'\ntop_corr_features = corr_matrix['SalePrice'].abs().sort_values(ascending=False).head(11).index\nprint(&quot;Top correlated features with SalePrice:\\n&quot;, top_corr_features)\n# Create a new correlation matrix with the top features\ntop_corr_matrix = df[top_corr_features].corr()\n# Create the heatmap\nplt.figure(figsize=(11, 8))\nsns.heatmap(top_corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, cbar=True, square=True, \n            linewidths=0.5, cbar_kws={&quot;shrink&quot;: .5})\nplt.title('Top Correlated Features with Sale Price', fontsize=16, pad=20)\nplt.tight_layout()\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #6A9955\"># Compute the correlation matrix<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">corr_matrix = df.corr()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #6A9955\"># Select the top features that correlate with &#39;SalePrice&#39;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">top_corr_features = corr_matrix[<\/span><span style=\"color: #CE9178\">&#39;SalePrice&#39;<\/span><span style=\"color: #D4D4D4\">].abs().sort_values(<\/span><span style=\"color: #9CDCFE\">ascending<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #569CD6\">False<\/span><span style=\"color: #D4D4D4\">).head(<\/span><span style=\"color: #B5CEA8\">11<\/span><span style=\"color: #D4D4D4\">).index<\/span><\/span>\n<span class=\"line\"><span style=\"color: #DCDCAA\">print<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #CE9178\">&quot;Top correlated features with SalePrice:<\/span><span style=\"color: #D7BA7D\">\\n<\/span><span style=\"color: #CE9178\">&quot;<\/span><span style=\"color: #D4D4D4\">, top_corr_features)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #6A9955\"># Create a new correlation matrix with the top features<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">top_corr_matrix = df[top_corr_features].corr()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #6A9955\"># Create the heatmap<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.figure(<\/span><span style=\"color: #9CDCFE\">figsize<\/span><span style=\"color: #D4D4D4\">=(<\/span><span style=\"color: #B5CEA8\">11<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #B5CEA8\">8<\/span><span style=\"color: #D4D4D4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">sns.heatmap(top_corr_matrix, <\/span><span style=\"color: #9CDCFE\">annot<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #569CD6\">True<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">cmap<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;coolwarm&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">vmin<\/span><span style=\"color: #D4D4D4\">=-<\/span><span style=\"color: #B5CEA8\">1<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">vmax<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #B5CEA8\">1<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">cbar<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #569CD6\">True<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">square<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #569CD6\">True<\/span><span style=\"color: #D4D4D4\">, <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">            <\/span><span style=\"color: #9CDCFE\">linewidths<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #B5CEA8\">0.5<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">cbar_kws<\/span><span style=\"color: #D4D4D4\">={<\/span><span style=\"color: #CE9178\">&quot;shrink&quot;<\/span><span style=\"color: #D4D4D4\">: <\/span><span style=\"color: #B5CEA8\">.5<\/span><span style=\"color: #D4D4D4\">})<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.title(<\/span><span style=\"color: #CE9178\">&#39;Top Correlated Features with Sale Price&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">fontsize<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #B5CEA8\">16<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">pad<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #B5CEA8\">20<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">plt.tight_layout()<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"859\" height=\"783\" src=\"https:\/\/www.econai.tech\/wp-content\/uploads\/2024\/07\/seaborn_top_corr_heatmap-1.png\" alt=\"\" class=\"wp-image-775\" srcset=\"https:\/\/tomomitanaka.ai\/wp-content\/uploads\/2024\/07\/seaborn_top_corr_heatmap-1.png 859w, https:\/\/tomomitanaka.ai\/wp-content\/uploads\/2024\/07\/seaborn_top_corr_heatmap-1-300x273.png 300w, https:\/\/tomomitanaka.ai\/wp-content\/uploads\/2024\/07\/seaborn_top_corr_heatmap-1-768x700.png 768w, https:\/\/www.econai.tech\/wp-content\/uploads\/2024\/07\/seaborn_top_corr_heatmap-1.png 856w\" sizes=\"auto, (max-width: 859px) 100vw, 859px\" \/><\/figure>\n\n\n\n<h5 class=\"wp-block-heading\">Interpretation of the Heatmap<\/h5>\n\n\n\n<p>The heatmap displays the correlation coefficients between the top 10 features most correlated with &#8220;SalePrice&#8221;.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\"><strong>Top Correlated Features with SalePrice<\/strong>:<\/h5>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>OverallQual<\/strong>: The quality of the material and finish of the house has the highest positive correlation with <code>SalePrice<\/code>. This indicates that better quality houses tend to sell for higher prices.<\/li>\n\n\n\n<li><strong>GrLivArea<\/strong>: Above ground living area is also highly positively correlated with SalePrice. Larger living areas contribute to higher house prices.<\/li>\n\n\n\n<li><strong>GarageCars<\/strong>: The number of cars that fit in the garage has a strong positive correlation with SalePrice. More garage space is associated with higher house prices.<\/li>\n\n\n\n<li><strong>GarageArea<\/strong>: Similar to <code>GarageCars<\/code>, the area of the garage is positively correlated with SalePrice.<\/li>\n\n\n\n<li><strong>TotalBsmtSF<\/strong>: Total square feet of the basement area shows a positive correlation with SalePrice. Larger basements generally lead to higher prices.<\/li>\n\n\n\n<li><strong>1stFlrSF<\/strong>: First floor square feet is positively correlated with SalePrice.<\/li>\n\n\n\n<li><strong>ExterQual_TA<\/strong>: The quality of the material on the exterior of the house is Average\/Typical. It is negatively correlated with SalePrice.<\/li>\n\n\n\n<li><strong>FullBath<\/strong>: The number of full bathrooms above grade is positively correlated with SalePrice.<\/li>\n\n\n\n<li><strong>BsmtQual_Ex<\/strong>: This feature means the height of the basement is 100+ inches (Excellent). It is positively correlated with SalePrice.<\/li>\n\n\n\n<li><strong>TotRmsAbvGrd<\/strong>: Total rooms above grade (excluding bathrooms) have a positive correlation with SalePrice (around 0.53).<\/li>\n<\/ol>\n\n\n\n<div class=\"wp-block-jin-gb-block-icon-box jin-icon-caution jin-iconbox\"><div class=\"jin-iconbox-icons\"><i class=\"jic jin-ifont-caution jin-icons\"><\/i><\/div><div class=\"jin-iconbox-main\">\n<p><strong>Safety Implication<\/strong>: The heatmap helps identify multicollinearity and potential proxy variables for protected characteristics.<\/p>\n\n\n\n<p><strong>Why it matters<\/strong>: High correlation between features can lead to unstable models. More critically, some features might act as proxies for protected characteristics (like race or gender), leading to potentially discriminatory predictions.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Conclusion<\/h3>\n\n\n\n<p>Data visualization is a powerful tool for uncovering hidden patterns and insights in our data. By using libraries like Matplotlib, Seaborn, and Plotly, we can create a variety of visualizations that help us understand the underlying structure of our dataset. <\/p>\n\n\n\n<p>These insights are crucial for building accurate predictive models and making informed decisions.<\/p>\n\n\n\n<p>You can find&nbsp;<a href=\"https:\/\/github.com\/tomomitanaka00\/Blog-Price-Prediction\/blob\/main\/Housing_Price_New.ipynb\">the complete code<\/a>&nbsp;for this visualization process in my GitHub repository.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Next Steps<\/h3>\n\n\n\n<p>In the next section, we will dive deeper into <a href=\"https:\/\/www.econai.tech\/?page_id=163\">feature engineering<\/a> to prepare our dataset for advanced regression techniques. Stay tuned!<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Data visualization is a critical step in the data analysis process. It helps us understand the distribution, trends, and relationships within the data. Using libraries like Matplotlib, Seaborn, and Plotly, we\u2019ll explore the #8220;House Prices \u2013 Advanced Regression Techniques#8221; dataset from Kaggle to uncover hidden patterns and gain insights that can inform our predictive<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":107,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-200","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/200","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=200"}],"version-history":[{"count":95,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/200\/revisions"}],"predecessor-version":[{"id":6853,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/200\/revisions\/6853"}],"up":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/107"}],"wp:attachment":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=200"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}