{"id":898,"date":"2024-08-05T10:58:41","date_gmt":"2024-08-05T14:58:41","guid":{"rendered":"https:\/\/www.econai.tech\/?page_id=898"},"modified":"2026-05-06T08:54:04","modified_gmt":"2026-05-06T12:54:04","slug":"customer-segmentation","status":"publish","type":"page","link":"https:\/\/tomomitanaka.ai\/?page_id=898","title":{"rendered":"Customer Segmentation"},"content":{"rendered":"\n<p><br>Understanding your customer base at a deeper level is essential for developing targeted marketing strategies, improving customer experiences, and driving business success.<\/p>\n\n\n\n<p>Customer segmentation is a powerful method for achieving this understanding. By grouping customers based on shared characteristics and behaviors, businesses can create personalized experiences that better meet the needs of each segment, leading to more effective engagement and increased customer loyalty.<\/p>\n\n\n\n<p>In this post, we&#8217;ll explore how to use BigQuery ML to create features for effective customer segmentation, apply K-Means and PCA + Clustering to identify distinct customer segments, and evaluate the quality of these segments.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Contents<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Data Preparation<\/li>\n\n\n\n<li>Applying Clustering Algorithms\n<ul class=\"wp-block-list\">\n<li>K-Means Clustering<\/li>\n\n\n\n<li>PCA + K-Means Clustering<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Understanding Cluster Characteristics<\/li>\n\n\n\n<li>Evaluating Customer Segments<\/li>\n\n\n\n<li>Conclusion<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Data Preparation<\/h3>\n\n\n\n<p>Feature creation is a critical step in any customer segmentation analysis, as the quality and relevance of the features directly influence the effectiveness of the segmentation. The goal is to derive meaningful characteristics from raw data that can help distinguish between different customer segments. In the provided code, the features are derived from the Google Analytics sample dataset, focusing on key metrics that capture various aspects of customer behavior, engagement, and demographics.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">SQL<\/span><span role=\"button\" tabindex=\"0\" data-code=\"-- Create Customer Features Table\nCREATE OR REPLACE TABLE `predictive-behavior-analytics.Section6.customer_features` AS\nSELECT\n  CONCAT(fullVisitorId, CAST(visitId AS STRING)) AS customer_id,\n  SUM(IFNULL(totals.transactionRevenue, 0)) \/ 1000000 AS total_revenue, \n  COUNT(totals.transactionRevenue) AS transaction_count,\n  AVG(IFNULL(totals.transactionRevenue, 0)) \/ 1000000 AS avg_transaction_value, \n  MAX(totals.timeOnSite) AS max_session_duration,\n  MIN(totals.timeOnSite) AS min_session_duration,\n  AVG(totals.timeOnSite) AS avg_session_duration,\n  device.deviceCategory AS device_type,\n  geoNetwork.country AS country\nFROM\n  `bigquery-public-data.google_analytics_sample.ga_sessions_*`\nWHERE\n  _TABLE_SUFFIX BETWEEN '20160801' AND '20170731'\nGROUP BY\n  customer_id, device_type, country;\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #6A9955\">-- Create Customer Features Table<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">TABLE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.customer_features`<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">CONCAT<\/span><span style=\"color: #D4D4D4\">(fullVisitorId, <\/span><span style=\"color: #DCDCAA\">CAST<\/span><span style=\"color: #D4D4D4\">(visitId <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> STRING)) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> customer_id,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">SUM<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">IFNULL<\/span><span style=\"color: #D4D4D4\">(totals.transactionRevenue, <\/span><span style=\"color: #B5CEA8\">0<\/span><span style=\"color: #D4D4D4\">)) \/ <\/span><span style=\"color: #B5CEA8\">1000000<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> total_revenue, <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">COUNT<\/span><span style=\"color: #D4D4D4\">(totals.transactionRevenue) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> transaction_count,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">AVG<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">IFNULL<\/span><span style=\"color: #D4D4D4\">(totals.transactionRevenue, <\/span><span style=\"color: #B5CEA8\">0<\/span><span style=\"color: #D4D4D4\">)) \/ <\/span><span style=\"color: #B5CEA8\">1000000<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_transaction_value, <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(totals.timeOnSite) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> max_session_duration,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MIN<\/span><span style=\"color: #D4D4D4\">(totals.timeOnSite) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> min_session_duration,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">AVG<\/span><span style=\"color: #D4D4D4\">(totals.timeOnSite) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_session_duration,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  device.deviceCategory <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> device_type,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  geoNetwork.country <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> country<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">`bigquery-public-data.google_analytics_sample.ga_sessions_*`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">WHERE<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  _TABLE_SUFFIX <\/span><span style=\"color: #569CD6\">BETWEEN<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">&#39;20160801&#39;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">AND<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">&#39;20170731&#39;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">GROUP BY<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  customer_id, device_type, country;<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Breakdown of Features Created<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Customer Identifier:<\/strong>\n<ul class=\"wp-block-list\">\n<li>customer_id: This feature uniquely identifies each customer by combining the fullVisitorId and visitId. This ensures that each customer is tracked consistently across their interactions, enabling accurate aggregation of their behaviors and transactions.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Revenue-Related Features:<\/strong>\n<ul class=\"wp-block-list\">\n<li>total_revenue: The total revenue generated by each customer, calculated by summing up the <code>transactionRevenue<\/code> over all sessions for each customer. This is a critical feature as it directly reflects the monetary value a customer brings to the business, making it essential for identifying high-value customers.<\/li>\n\n\n\n<li>transaction_count: The total number of transactions a customer has made, providing insight into their purchasing frequency. This feature can help distinguish between frequent buyers and those who make fewer, potentially higher-value purchases.<\/li>\n\n\n\n<li>avg_transaction_value: The average value of each transaction, calculated by dividing the total revenue by the transaction count. This feature helps identify customers who make high-value purchases compared to those who might purchase frequently but with lower transaction values.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Engagement-Related Features:<\/strong>\n<ul class=\"wp-block-list\">\n<li>max_session_duration: The longest time a customer has spent in a session. This can indicate the maximum level of engagement a customer has shown, which might correlate with interest in the content or products offered.<\/li>\n\n\n\n<li>min_session_duration: The shortest session duration for a customer. This feature could help identify customers with minimal engagement or those who might visit the site frequently for very specific purposes.<\/li>\n\n\n\n<li>avg_session_duration: The average session duration, giving a general sense of how much time a customer typically spends per visit. This feature can help identify highly engaged customers who consistently spend more time on the site.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Demographic Features:<\/strong>\n<ul class=\"wp-block-list\">\n<li>device_type: The type of device (e.g., desktop, mobile, tablet) used by the customer. This feature is crucial for understanding customer preferences in terms of technology and can help tailor marketing strategies to different device users.<\/li>\n\n\n\n<li>country: The country from which the customer is accessing the site. Geographic location is an important demographic feature that can influence customer behavior and preferences, making it a valuable input for segmentation.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<h4 class=\"wp-block-heading\">Importance of These Features<\/h4>\n\n\n\n<p>Each of these features plays a significant role in the customer segmentation process:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Revenue-Related Features<\/strong>: These are vital for understanding the financial contribution of each customer and identifying high-value segments. They help in distinguishing between customers who generate high revenue through frequent or high-value purchases.<\/li>\n\n\n\n<li><strong>Engagement-Related Features<\/strong>: These features provide insights into how customers interact with the site. High engagement often correlates with higher loyalty and the potential for future purchases, making it a key factor in segmentation.<\/li>\n\n\n\n<li><strong>Demographic Features<\/strong>: Understanding the demographic makeup of your customer base allows for more targeted marketing strategies. Device type and geographic location can significantly influence how customers interact with the site and what products or services they prefer.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Applying Clustering Algorithms<\/h3>\n\n\n\n<p>In this section, we explore two clustering approaches to segment customers based on their behaviors and characteristics: <strong>K-Means Clustering<\/strong> on original features and <strong>PCA + K-Means Clustering<\/strong>. These methods help identify distinct customer segments that can be targeted with tailored marketing strategies.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">1. K-Means Clustering<\/h4>\n\n\n\n<p><strong>K-Means Clustering<\/strong> is a popular algorithm for partitioning data into distinct clusters based on the similarity of data points. <\/p>\n\n\n\n<p>In this analysis, we apply K-Means directly to the original customer features to identify natural groupings among customers.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">SQL<\/span><span role=\"button\" tabindex=\"0\" data-code=\"-- K-Means Clustering on Original Features\nCREATE OR REPLACE MODEL `predictive-behavior-analytics.Section6.kmeans_customer_segmentation_revised`\nOPTIONS(model_type='kmeans', num_clusters=5) AS\nSELECT\n  total_revenue,\n  transaction_count,\n  avg_transaction_value,\n  avg_session_duration\nFROM\n  `predictive-behavior-analytics.Section6.customer_features`;\n\n-- Retrieving Centroid Values\nSELECT\n  centroid_id,\n  MAX(CASE WHEN feature = 'total_revenue' THEN numerical_value END) AS avg_total_revenue,\n  MAX(CASE WHEN feature = 'transaction_count' THEN numerical_value END) AS avg_transaction_count,\n  MAX(CASE WHEN feature = 'avg_transaction_value' THEN numerical_value END) AS avg_transaction_value,\n  MAX(CASE WHEN feature = 'avg_session_duration' THEN numerical_value END) AS avg_session_duration\nFROM\n  ML.CENTROIDS(MODEL `predictive-behavior-analytics.Section6.kmeans_customer_segmentation_revised`)\nGROUP BY\n  centroid_id\nORDER BY\n  avg_total_revenue DESC;\n\n-- Assign Clusters to Customers for K-Means on Original Features\nCREATE OR REPLACE TABLE `predictive-behavior-analytics.Section6.customer_clusters` AS\nSELECT\n  customer_id,\n  CENTROID_ID AS cluster_id\nFROM\n  ML.PREDICT(MODEL `predictive-behavior-analytics.Section6.kmeans_customer_segmentation_revised`,\n    (\n      SELECT\n        customer_id,\n        total_revenue,\n        transaction_count,\n        avg_transaction_value,\n        avg_session_duration\n      FROM\n        `predictive-behavior-analytics.Section6.customer_features`\n    )\n  );\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #6A9955\">-- K-Means Clustering on Original Features<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> MODEL <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.kmeans_customer_segmentation_revised`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">OPTIONS(model_type=<\/span><span style=\"color: #CE9178\">&#39;kmeans&#39;<\/span><span style=\"color: #D4D4D4\">, num_clusters=<\/span><span style=\"color: #B5CEA8\">5<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  total_revenue,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  transaction_count,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  avg_transaction_value,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  avg_session_duration<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.customer_features`<\/span><span style=\"color: #D4D4D4\">;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A9955\">-- Retrieving Centroid Values<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  centroid_id,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">CASE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">WHEN<\/span><span style=\"color: #D4D4D4\"> feature = <\/span><span style=\"color: #CE9178\">&#39;total_revenue&#39;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">THEN<\/span><span style=\"color: #D4D4D4\"> numerical_value <\/span><span style=\"color: #569CD6\">END<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_total_revenue,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">CASE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">WHEN<\/span><span style=\"color: #D4D4D4\"> feature = <\/span><span style=\"color: #CE9178\">&#39;transaction_count&#39;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">THEN<\/span><span style=\"color: #D4D4D4\"> numerical_value <\/span><span style=\"color: #569CD6\">END<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_transaction_count,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">CASE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">WHEN<\/span><span style=\"color: #D4D4D4\"> feature = <\/span><span style=\"color: #CE9178\">&#39;avg_transaction_value&#39;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">THEN<\/span><span style=\"color: #D4D4D4\"> numerical_value <\/span><span style=\"color: #569CD6\">END<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_transaction_value,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">CASE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">WHEN<\/span><span style=\"color: #D4D4D4\"> feature = <\/span><span style=\"color: #CE9178\">&#39;avg_session_duration&#39;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">THEN<\/span><span style=\"color: #D4D4D4\"> numerical_value <\/span><span style=\"color: #569CD6\">END<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_session_duration<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  ML.CENTROIDS(MODEL <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.kmeans_customer_segmentation_revised`<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">GROUP BY<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  centroid_id<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">ORDER BY<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  avg_total_revenue <\/span><span style=\"color: #569CD6\">DESC<\/span><span style=\"color: #D4D4D4\">;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A9955\">-- Assign Clusters to Customers for K-Means on Original Features<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">TABLE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.customer_clusters`<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  customer_id,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  CENTROID_ID <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> cluster_id<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  ML.PREDICT(MODEL <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.kmeans_customer_segmentation_revised`<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    (<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">      <\/span><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        customer_id,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        total_revenue,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        transaction_count,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        avg_transaction_value,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        avg_session_duration<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">      <\/span><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.customer_features`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    )<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  );<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">2. PCA + K-Means Clustering<\/h4>\n\n\n\n<p><strong>PCA + K-Means Clustering<\/strong> is an advanced approach that first applies Principal Component Analysis (PCA) to reduce the dimensionality of the feature space before clustering. This method can enhance the clustering process by focusing on the most critical aspects of customer behavior, which are captured in the principal components.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">SQL<\/span><span role=\"button\" tabindex=\"0\" data-code=\"-- Apply PCA to Customer Features\nCREATE OR REPLACE MODEL `predictive-behavior-analytics.Section6.pca_customer_features`\nOPTIONS(model_type='pca', num_principal_components=3) AS\nSELECT\n  total_revenue,\n  transaction_count,\n  avg_transaction_value,\n  avg_session_duration\nFROM\n  `predictive-behavior-analytics.Section6.customer_features`;\n\n-- Retrieve the Principal Components \n\nCREATE OR REPLACE TABLE `predictive-behavior-analytics.Section6.pca_transformed_features` AS\nSELECT\n  customer_id,\n  principal_component_1,\n  principal_component_2,\n  principal_component_3\nFROM\n  ML.PREDICT(MODEL `predictive-behavior-analytics.Section6.pca_customer_features`,\n    (\n      SELECT\n        customer_id,\n        total_revenue,\n        transaction_count,\n        avg_transaction_value,\n        avg_session_duration\n      FROM\n        `predictive-behavior-analytics.Section6.customer_features`\n    )\n  );\n\n  -- K-Means Clustering on PCA-Transformed Features\nCREATE OR REPLACE MODEL `predictive-behavior-analytics.Section6.kmeans_pca_customer_segmentation`\nOPTIONS(model_type='kmeans', num_clusters=5) AS\nSELECT\n  principal_component_1,\n  principal_component_2,\n  principal_component_3\nFROM\n  `predictive-behavior-analytics.Section6.pca_transformed_features`;\n\n  -- Retrieving Centroid Values for PCA + K-Means\nSELECT\n  centroid_id,\n  MAX(CASE WHEN feature = 'principal_component_1' THEN numerical_value END) AS avg_principal_component_1,\n  MAX(CASE WHEN feature = 'principal_component_2' THEN numerical_value END) AS avg_principal_component_2,\n  MAX(CASE WHEN feature = 'principal_component_3' THEN numerical_value END) AS avg_principal_component_3\nFROM\n  ML.CENTROIDS(MODEL `predictive-behavior-analytics.Section6.kmeans_pca_customer_segmentation`)\nGROUP BY\n  centroid_id\nORDER BY\n  avg_principal_component_1 DESC;\n\n-- Assign Clusters to Customers for PCA + K-Means\nCREATE OR REPLACE TABLE `predictive-behavior-analytics.Section6.customer_clusters_pca` AS\nSELECT\n  customer_id,\n  CENTROID_ID AS cluster_id\nFROM\n  ML.PREDICT(MODEL `predictive-behavior-analytics.Section6.kmeans_pca_customer_segmentation`,\n    (\n      SELECT\n        customer_id,\n        principal_component_1,\n        principal_component_2,\n        principal_component_3\n      FROM\n        `predictive-behavior-analytics.Section6.pca_transformed_features`\n    )\n  );\n\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #6A9955\">-- Apply PCA to Customer Features<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> MODEL <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.pca_customer_features`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">OPTIONS(model_type=<\/span><span style=\"color: #CE9178\">&#39;pca&#39;<\/span><span style=\"color: #D4D4D4\">, num_principal_components=<\/span><span style=\"color: #B5CEA8\">3<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  total_revenue,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  transaction_count,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  avg_transaction_value,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  avg_session_duration<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.customer_features`<\/span><span style=\"color: #D4D4D4\">;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A9955\">-- Retrieve the Principal Components <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">TABLE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.pca_transformed_features`<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  customer_id,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  principal_component_1,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  principal_component_2,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  principal_component_3<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  ML.PREDICT(MODEL <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.pca_customer_features`<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    (<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">      <\/span><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        customer_id,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        total_revenue,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        transaction_count,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        avg_transaction_value,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        avg_session_duration<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">      <\/span><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.customer_features`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    )<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  );<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #6A9955\">-- K-Means Clustering on PCA-Transformed Features<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> MODEL <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.kmeans_pca_customer_segmentation`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">OPTIONS(model_type=<\/span><span style=\"color: #CE9178\">&#39;kmeans&#39;<\/span><span style=\"color: #D4D4D4\">, num_clusters=<\/span><span style=\"color: #B5CEA8\">5<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  principal_component_1,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  principal_component_2,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  principal_component_3<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.pca_transformed_features`<\/span><span style=\"color: #D4D4D4\">;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #6A9955\">-- Retrieving Centroid Values for PCA + K-Means<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  centroid_id,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">CASE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">WHEN<\/span><span style=\"color: #D4D4D4\"> feature = <\/span><span style=\"color: #CE9178\">&#39;principal_component_1&#39;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">THEN<\/span><span style=\"color: #D4D4D4\"> numerical_value <\/span><span style=\"color: #569CD6\">END<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_principal_component_1,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">CASE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">WHEN<\/span><span style=\"color: #D4D4D4\"> feature = <\/span><span style=\"color: #CE9178\">&#39;principal_component_2&#39;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">THEN<\/span><span style=\"color: #D4D4D4\"> numerical_value <\/span><span style=\"color: #569CD6\">END<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_principal_component_2,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">CASE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">WHEN<\/span><span style=\"color: #D4D4D4\"> feature = <\/span><span style=\"color: #CE9178\">&#39;principal_component_3&#39;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">THEN<\/span><span style=\"color: #D4D4D4\"> numerical_value <\/span><span style=\"color: #569CD6\">END<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_principal_component_3<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  ML.CENTROIDS(MODEL <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.kmeans_pca_customer_segmentation`<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">GROUP BY<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  centroid_id<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">ORDER BY<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  avg_principal_component_1 <\/span><span style=\"color: #569CD6\">DESC<\/span><span style=\"color: #D4D4D4\">;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A9955\">-- Assign Clusters to Customers for PCA + K-Means<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">TABLE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.customer_clusters_pca`<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  customer_id,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  CENTROID_ID <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> cluster_id<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  ML.PREDICT(MODEL <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.kmeans_pca_customer_segmentation`<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    (<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">      <\/span><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        customer_id,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        principal_component_1,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        principal_component_2,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        principal_component_3<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">      <\/span><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #CE9178\">`predictive-behavior-analytics.Section6.pca_transformed_features`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    )<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  );<\/span><\/span>\n<span class=\"line\"><\/span><\/code><\/pre><\/div>\n\n\n\n<p>You can find&nbsp;<a href=\"https:\/\/github.com\/tomomitanaka00\/Blog-SQL\/blob\/main\/Section6.sql\">the complete code<\/a>&nbsp;in&nbsp;my GitHub repository.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Assigning Clusters to Customers<\/h4>\n\n\n\n<p>Once the clustering models have been trained, the next crucial step is to assign each customer to a specific cluster. This process allows us to categorize customers based on their behaviors and characteristics, enabling more targeted marketing strategies and personalized customer experiences.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Importance of Cluster Assignment<\/h4>\n\n\n\n<p>Assigning clusters to customers is a vital step in customer segmentation. It not only categorizes customers into meaningful groups but also serves as the foundation for subsequent analyses and actions. By understanding the characteristics of each cluster, businesses can:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Tailor Marketing Strategies:<\/strong> Design targeted campaigns that resonate with the specific needs and preferences of each customer segment.<\/li>\n\n\n\n<li><strong>Enhance Customer Engagement:<\/strong> Offer personalized experiences that improve customer satisfaction and loyalty.<\/li>\n\n\n\n<li><strong>Optimize Resource Allocation:<\/strong> Focus efforts and resources on high-value customer segments to maximize return on investment (ROI).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Understanding Cluster Characteristics<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">K-Means Clustering on Original Features<\/h4>\n\n\n\n<p>The K-Means clustering applied directly to the original features produced the following centroids:<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Results<\/h5>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td>Cluster<\/td><td class=\"has-text-align-right\" data-align=\"right\">Type<\/td><td class=\"has-text-align-right\" data-align=\"right\">Avg Total Revenue<\/td><td class=\"has-text-align-right\" data-align=\"right\">Avg Session Duration<\/td><\/tr><tr><td>Cluster 1<\/td><td class=\"has-text-align-right\" data-align=\"right\">High-value customers<\/td><td class=\"has-text-align-right\" data-align=\"right\">$10,109<\/td><td class=\"has-text-align-right\" data-align=\"right\">1,909 seconds<\/td><\/tr><tr><td>Cluster 2<\/td><td class=\"has-text-align-right\" data-align=\"right\">Mid-range customers<\/td><td class=\"has-text-align-right\" data-align=\"right\">$993<\/td><td class=\"has-text-align-right\" data-align=\"right\">1,517 seconds<\/td><\/tr><tr><td>Cluster 3<\/td><td class=\"has-text-align-right\" data-align=\"right\">Low-end customers<\/td><td class=\"has-text-align-right\" data-align=\"right\">$79<\/td><td class=\"has-text-align-right\" data-align=\"right\">1,037 seconds<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Cluster 1 (High-value customers):<\/strong> <strong>Ave. Total Revenue $10,109<\/strong><\/p>\n\n\n\n<p>This cluster represents high-value customers who have made large transactions. The long session duration suggests they spend a significant amount of time on the site per visit.<\/p>\n\n\n\n<p><strong>Cluster 2 (Mid-range customers):<\/strong> <strong>Ave. Total Revenue $993<\/strong><\/p>\n\n\n\n<p>This cluster represents mid-range customers who make moderate transactions. The transaction value is significantly lower than Cluster 1, but the session duration is still relatively high.<\/p>\n\n\n\n<p><strong>Cluster 3 (Low-end customers): Ave. Total Revenue $79<\/strong><\/p>\n\n\n\n<p>These customers represent the lower end of the spectrum, making small transactions. The shorter session duration also reflects less engagement compared to higher-value clusters.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">PCA + K-Means Clustering<\/h4>\n\n\n\n<p>When PCA was applied before K-Means clustering, the centroids were represented in terms of principal components:<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Results<\/h5>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Centroid 2:<\/strong><br><strong>Avg Principal Component 1:<\/strong> 256.71<br><strong>Avg Principal Component 2:<\/strong> 86.92<br><strong>Avg Principal Component 3:<\/strong> -6.01 <\/p>\n\n\n\n<p>This centroid is significantly larger in the first two principal components, suggesting that the corresponding cluster captures the most variance in the data. It likely represents the most distinct and impactful customer segment.<\/p>\n\n\n\n<p><strong>Centroid 1:<\/strong><br><strong>Avg Principal Component 1:<\/strong> 27.14<br><strong>Avg Principal Component 2:<\/strong> 0.46<br><strong>Avg Principal Component 3:<\/strong> -0.07<\/p>\n\n\n\n<p>This cluster is smaller but still distinct, representing customers who differ moderately from others.<\/p>\n\n\n\n<p><strong>Centroids 3,4,5:<\/strong><\/p>\n\n\n\n<p>Values close to zero These centroids suggest that the corresponding clusters are less distinct or represent customers whose behaviors do not vary significantly from the overall population.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Comparison of K-Means vs. PCA + K-Means<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Cluster Interpretability:<\/strong><\/li>\n<\/ol>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>K-Means on Original Features:<\/strong> The centroids directly represent customer behaviors, making it easier to interpret the characteristics of each cluster. For example, high total revenue and transaction value directly indicate high-value customers.<\/li>\n\n\n\n<li><strong>PCA + K-Means:<\/strong> The clusters are defined in terms of principal components, which can be more challenging to interpret. However, PCA reduces dimensionality and might capture underlying patterns more effectively.<\/li>\n<\/ul>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Cluster Distinctiveness:<\/strong><\/li>\n<\/ol>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>K-Means on Original Features:<\/strong> The clusters are highly distinct, with significant differences in revenue and transaction values across centroids.<\/li>\n\n\n\n<li><strong>PCA + K-Means:<\/strong> The clusters are distinct in terms of principal components, but the interpretation of what these components represent is less straightforward.<\/li>\n<\/ul>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Variance Capture:<\/strong><\/li>\n<\/ol>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>K-Means on Original Features:<\/strong> The clustering is based on the original features, potentially leading to overfitting or less generalizable clusters.<\/li>\n\n\n\n<li><strong>PCA + K-Means:<\/strong> PCA helps capture the most significant variance in the data before clustering, potentially leading to more generalizable and robust clusters.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Conclusion<\/h4>\n\n\n\n<p>Both K-Means and PCA + K-Means have their strengths. <\/p>\n\n\n\n<p><strong>K-Means on Original Features<\/strong> is more interpretable, with centroids that directly reflect customer behaviors. <\/p>\n\n\n\n<p>In contrast, <strong>PCA + K-Means<\/strong> may capture more nuanced patterns in the data, but at the cost of interpretability. <\/p>\n\n\n\n<p>Depending on the business need\u2014whether it\u2019s understanding customer behavior directly or focusing on variance reduction\u2014one method may be preferred over the other.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Evaluating Customer Segments<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Silhouette Score Overview<\/h4>\n\n\n\n<p>The silhouette score is a metric used to evaluate the quality of clusters created by clustering algorithms like K-Means. It measures how similar an object is to its own cluster compared to other clusters. The score ranges from -1 to 1, where:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A score close to 1 indicates that the data points are well-matched to their own cluster and poorly matched to neighboring clusters.<\/li>\n\n\n\n<li>A score close to 0 indicates that the data points are on or very close to the decision boundary between clusters.<\/li>\n\n\n\n<li>A negative score suggests that the data points may have been assigned to the wrong cluster.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Results Comparison<\/h4>\n\n\n\n<h5 class=\"wp-block-heading\"><strong>K-Means Clustering on Original Features:<\/strong><\/h5>\n\n\n\n<p><strong>Silhouette Score:<\/strong> 0.7139 <\/p>\n\n\n\n<p>The silhouette score of approximately 0.714 for K-Means on the original features suggests that the clusters are reasonably well-formed, with most data points being appropriately assigned to their respective clusters. <\/p>\n\n\n\n<p>However, the score is not very close to 1, indicating that there is still some overlap between clusters or that the clusters may not be optimally separated.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\"><strong>PCA + K-Means Clustering:<\/strong><\/h5>\n\n\n\n<p><strong>Silhouette Score:<\/strong> 0.9082 <\/p>\n\n\n\n<p>The silhouette score of approximately 0.908 for PCA + K-Means clustering is significantly higher than that of K-Means on the original features. <\/p>\n\n\n\n<p>This high score indicates that the clusters are well-separated and that the data points are much better matched to their respective clusters. <\/p>\n\n\n\n<p>The use of PCA before clustering appears to have enhanced the cluster separability, leading to more distinct and well-defined clusters.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Interpretation<\/h4>\n\n\n\n<p><strong>Cluster Quality:<\/strong><\/p>\n\n\n\n<p>The higher silhouette score for PCA + K-Means indicates superior cluster quality compared to K-Means on the original features. <\/p>\n\n\n\n<p>This suggests that applying PCA before clustering not only reduced the dimensionality but also captured the most significant variance in the data, leading to more coherent clusters.<\/p>\n\n\n\n<p><strong>Dimensionality Reduction:<\/strong><\/p>\n\n\n\n<p>By reducing the dimensionality with PCA, the data becomes more compact and focused on the most important features, which helps the K-Means algorithm perform better. <\/p>\n\n\n\n<p>The PCA transformation likely helped to eliminate noise and irrelevant features, which could have contributed to the improved clustering performance.<\/p>\n\n\n\n<p><strong>Interpretability:<\/strong><\/p>\n\n\n\n<p>While PCA + K-Means offers better clustering performance, the interpretation of clusters based on principal components can be more challenging compared to clusters formed directly on the original features. <\/p>\n\n\n\n<p>However, if the primary goal is to achieve the best possible clustering, PCA + K-Means is the preferable method based on these results.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Conclusion<\/h3>\n\n\n\n<p>The comparison of silhouette scores clearly indicates that <strong>PCA + K-Means<\/strong> is a more effective clustering approach in this scenario. It leads to better-defined clusters with less overlap and greater internal coherence. <\/p>\n\n\n\n<p>This method is particularly useful when the goal is to enhance the quality of clustering, even if it comes at the cost of some interpretability due to the transformation of original features into principal components. <\/p>\n\n\n\n<p>On the other hand, if interpretability of the original features is critical, you might prefer the K-Means clustering directly on the original features, despite the slightly lower clustering quality.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Understanding your customer base at a deeper level is essential for developing targeted marketing strategies, improving customer experiences, and driving business success. Customer segmentation is a powerful method for achieving this understanding. By grouping customers based on shared characteristics and behaviors, businesses can create personalized experiences that better meet the needs of each segment, leading<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":877,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-898","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/898","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=898"}],"version-history":[{"count":110,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/898\/revisions"}],"predecessor-version":[{"id":6371,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/898\/revisions\/6371"}],"up":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/877"}],"wp:attachment":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=898"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}