{"id":896,"date":"2024-08-05T10:58:11","date_gmt":"2024-08-05T14:58:11","guid":{"rendered":"https:\/\/www.econai.tech\/?page_id=896"},"modified":"2026-05-06T08:54:19","modified_gmt":"2026-05-06T12:54:19","slug":"predicting-user-conversion","status":"publish","type":"page","link":"https:\/\/tomomitanaka.ai\/?page_id=896","title":{"rendered":"Predicting User Conversion"},"content":{"rendered":"\n<p>In today&#8217;s data-driven marketing landscape, predicting which users are likely to convert is crucial for optimizing strategies and improving ROI. In this blog post, we&#8217;ll walk through the process of building user conversion prediction models using BigQuery ML and the Google Analytics Sample Dataset. We&#8217;ll cover feature development, model building, and evaluation, all using SQL.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Contents<\/h4>\n\n\n\n<p><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Feature Development<\/li>\n\n\n\n<li>Building and Training Predictive Models\n<ul class=\"wp-block-list\">\n<li>Logistic Regression Model<\/li>\n\n\n\n<li>Random Forest Model<\/li>\n\n\n\n<li>XGBoost Model<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Confusion Matrix Results<\/li>\n\n\n\n<li>Performance Comparison<\/li>\n\n\n\n<li>Feature Importance<\/li>\n\n\n\n<li>Conclusion<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">1. Feature Development<\/h3>\n\n\n\n<p>Let&#8217;s create features that could be indicative of user conversion:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">SQL<\/span><span role=\"button\" tabindex=\"0\" data-code=\"CREATE OR REPLACE TABLE `your-project.your-dataset.user_features` AS\nWITH user_sessions AS (\n  SELECT\n    fullVisitorId,\n    PARSE_DATE('%Y%m%d', date) AS visit_date,\n    totals.transactions,\n    totals.timeOnSite,\n    totals.pageviews,\n    device.deviceCategory,\n    geoNetwork.country,\n    trafficSource.medium\n  FROM\n    `bigquery-public-data.google_analytics_sample.ga_sessions_*`\n  WHERE\n    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'\n)\nSELECT\n  fullVisitorId,\n  MAX(CASE WHEN transactions &gt; 0 THEN 1 ELSE 0 END) AS has_converted,\n  COUNT(DISTINCT visit_date) AS num_visits,\n  AVG(timeOnSite) AS avg_time_on_site,\n  AVG(pageviews) AS avg_pageviews,\n  MAX(deviceCategory) AS device_category,\n  MAX(country) AS country,\n  MAX(medium) AS traffic_medium,\n  SUM(pageviews) AS total_pageviews,\n  SUM(timeOnSite) AS total_time_on_site,\n  DATE_DIFF(MAX(visit_date), MIN(visit_date), DAY) AS days_since_first_visit\nFROM\n  user_sessions\nGROUP BY\n  fullVisitorId;\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">TABLE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">`your-project.your-dataset.user_features`<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">WITH<\/span><span style=\"color: #D4D4D4\"> user_sessions <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> (<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    fullVisitorId,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    PARSE_DATE(<\/span><span style=\"color: #CE9178\">&#39;%Y%m%d&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #569CD6\">date<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> visit_date,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    totals.transactions,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    totals.timeOnSite,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    totals.pageviews,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    device.deviceCategory,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    geoNetwork.country,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    trafficSource.medium<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #CE9178\">`bigquery-public-data.google_analytics_sample.ga_sessions_*`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #569CD6\">WHERE<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    _TABLE_SUFFIX <\/span><span style=\"color: #569CD6\">BETWEEN<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">&#39;20170701&#39;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">AND<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #CE9178\">&#39;20170801&#39;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  fullVisitorId,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">CASE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">WHEN<\/span><span style=\"color: #D4D4D4\"> transactions &gt; <\/span><span style=\"color: #B5CEA8\">0<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">THEN<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #B5CEA8\">1<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">ELSE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #B5CEA8\">0<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">END<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> has_converted,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">COUNT<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">DISTINCT<\/span><span style=\"color: #D4D4D4\"> visit_date) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> num_visits,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">AVG<\/span><span style=\"color: #D4D4D4\">(timeOnSite) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_time_on_site,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">AVG<\/span><span style=\"color: #D4D4D4\">(pageviews) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> avg_pageviews,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(deviceCategory) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> device_category,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(country) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> country,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(medium) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> traffic_medium,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">SUM<\/span><span style=\"color: #D4D4D4\">(pageviews) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> total_pageviews,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #DCDCAA\">SUM<\/span><span style=\"color: #D4D4D4\">(timeOnSite) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> total_time_on_site,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  DATE_DIFF(<\/span><span style=\"color: #DCDCAA\">MAX<\/span><span style=\"color: #D4D4D4\">(visit_date), <\/span><span style=\"color: #DCDCAA\">MIN<\/span><span style=\"color: #D4D4D4\">(visit_date), <\/span><span style=\"color: #569CD6\">DAY<\/span><span style=\"color: #D4D4D4\">) <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> days_since_first_visit<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  user_sessions<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">GROUP BY<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  fullVisitorId;<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>In this query, we&#8217;ve created several features that might be predictive of conversion:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>has_converted: Whether the user has made a purchase (our target variable)<\/li>\n\n\n\n<li>num_visits: Number of visits by the user<\/li>\n\n\n\n<li>avg_time_on_site: Average time spent on the site per visit<\/li>\n\n\n\n<li>avg_pageviews: Average number of pages viewed per visit<\/li>\n\n\n\n<li>device_category: The user&#8217;s device type<\/li>\n\n\n\n<li>country: The user&#8217;s country<\/li>\n\n\n\n<li>traffic_medium: The traffic source medium<\/li>\n\n\n\n<li>total_pageviews: Total number of pages viewed across all visits<\/li>\n\n\n\n<li>total_time_on_site: Total time spent on the site across all visits<\/li>\n\n\n\n<li>days_since_first_visit: Number of days between the user&#8217;s first and last visit<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">2. Building and Training Predictive Models<\/h3>\n\n\n\n<p>Let&#8217;s build three different types of models to predict user conversion: Logistic Regression, Random Forest, and XGBoost.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Logistic Regression Model<\/h4>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">SQL<\/span><span role=\"button\" tabindex=\"0\" data-code=\"CREATE OR REPLACE MODEL `your-project.your-dataset.user_conversion_logistic`\nOPTIONS(model_type='logistic_reg', input_label_cols=['has_converted']) AS\nSELECT\n  * EXCEPT(fullVisitorId)\nFROM\n  `your-project.your-dataset.user_features`;\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> MODEL <\/span><span style=\"color: #CE9178\">`your-project.your-dataset.user_conversion_logistic`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">OPTIONS(model_type=<\/span><span style=\"color: #CE9178\">&#39;logistic_reg&#39;<\/span><span style=\"color: #D4D4D4\">, input_label_cols=[&#39;has_converted&#39;]) <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  * <\/span><span style=\"color: #569CD6\">EXCEPT<\/span><span style=\"color: #D4D4D4\">(fullVisitorId)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">`your-project.your-dataset.user_features`<\/span><span style=\"color: #D4D4D4\">;<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Random Forest Model<\/h4>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">SQL<\/span><span role=\"button\" tabindex=\"0\" data-code=\"CREATE OR REPLACE MODEL `your-project.your-dataset.user_conversion_random_forest`\nOPTIONS(model_type='random_forest_classifier', input_label_cols=['has_converted']) AS\nSELECT\n  * EXCEPT(fullVisitorId)\nFROM\n  `your-project.your-dataset.user_features`;\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> MODEL <\/span><span style=\"color: #CE9178\">`your-project.your-dataset.user_conversion_random_forest`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">OPTIONS(model_type=<\/span><span style=\"color: #CE9178\">&#39;random_forest_classifier&#39;<\/span><span style=\"color: #D4D4D4\">, input_label_cols=[&#39;has_converted&#39;]) <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  * <\/span><span style=\"color: #569CD6\">EXCEPT<\/span><span style=\"color: #D4D4D4\">(fullVisitorId)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">`your-project.your-dataset.user_features`<\/span><span style=\"color: #D4D4D4\">;<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">XGBoost Model<\/h4>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">SQL<\/span><span role=\"button\" tabindex=\"0\" data-code=\"CREATE OR REPLACE MODEL `your-project.your-dataset.user_conversion_xgboost`\nOPTIONS(model_type='boosted_tree_classifier', input_label_cols=['has_converted']) AS\nSELECT\n  * EXCEPT(fullVisitorId)\nFROM\n  `your-project.your-dataset.user_features`;\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">OR<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">REPLACE<\/span><span style=\"color: #D4D4D4\"> MODEL <\/span><span style=\"color: #CE9178\">`your-project.your-dataset.user_conversion_xgboost`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">OPTIONS(model_type=<\/span><span style=\"color: #CE9178\">&#39;boosted_tree_classifier&#39;<\/span><span style=\"color: #D4D4D4\">, input_label_cols=[&#39;has_converted&#39;]) <\/span><span style=\"color: #569CD6\">AS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  * <\/span><span style=\"color: #569CD6\">EXCEPT<\/span><span style=\"color: #D4D4D4\">(fullVisitorId)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #CE9178\">`your-project.your-dataset.user_features`<\/span><span style=\"color: #D4D4D4\">;<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3. Confusion Matrix Results<\/h3>\n\n\n\n<p>Let&#8217;s evaluate the performance of all three models:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">SQL<\/span><span role=\"button\" tabindex=\"0\" data-code=\"-- Function to evaluate a model\nCREATE TEMP FUNCTION EvaluateModel(model_name STRING)\nRETURNS TABLE&lt;\n  model STRING,\n  accuracy FLOAT64,\n  precision FLOAT64,\n  recall FLOAT64,\n  f1_score FLOAT64,\n  log_loss FLOAT64,\n  roc_auc FLOAT64\n&gt;\nAS ((\n  SELECT\n    model_name AS model,\n    accuracy,\n    precision,\n    recall,\n    f1_score,\n    log_loss,\n    roc_auc\n  FROM\n    ML.EVALUATE(MODEL `your-project.your-dataset.${model_name}`,\n      (\n      SELECT\n        * EXCEPT(fullVisitorId)\n      FROM\n        `your-project.your-dataset.user_features`\n      )\n    )\n));\n\n-- Evaluate all models\nSELECT * FROM EvaluateModel('user_conversion_logistic')\nUNION ALL\nSELECT * FROM EvaluateModel('user_conversion_random_forest')\nUNION ALL\nSELECT * FROM EvaluateModel('user_conversion_xgboost')\nORDER BY model;\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #6A9955\">-- Function to evaluate a model<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">CREATE<\/span><span style=\"color: #D4D4D4\"> TEMP <\/span><span style=\"color: #569CD6\">FUNCTION<\/span><span style=\"color: #D4D4D4\"> EvaluateModel(model_name STRING)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">RETURNS<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">TABLE<\/span><span style=\"color: #D4D4D4\">&lt;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  model STRING,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  accuracy FLOAT64,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #569CD6\">precision<\/span><span style=\"color: #D4D4D4\"> FLOAT64,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  recall FLOAT64,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  f1_score FLOAT64,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  log_loss FLOAT64,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  roc_auc FLOAT64<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">&gt;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> ((<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    model_name <\/span><span style=\"color: #569CD6\">AS<\/span><span style=\"color: #D4D4D4\"> model,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    accuracy,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #569CD6\">precision<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    recall,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    f1_score,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    log_loss,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    roc_auc<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">  <\/span><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    ML.EVALUATE(MODEL <\/span><span style=\"color: #CE9178\">`your-project.your-dataset.${model_name}`<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">      (<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">      <\/span><span style=\"color: #569CD6\">SELECT<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        * <\/span><span style=\"color: #569CD6\">EXCEPT<\/span><span style=\"color: #D4D4D4\">(fullVisitorId)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">      <\/span><span style=\"color: #569CD6\">FROM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #CE9178\">`your-project.your-dataset.user_features`<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">      )<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    )<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">));<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A9955\">-- Evaluate all models<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><span style=\"color: #D4D4D4\"> * <\/span><span style=\"color: #569CD6\">FROM<\/span><span style=\"color: #D4D4D4\"> EvaluateModel(<\/span><span style=\"color: #CE9178\">&#39;user_conversion_logistic&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">UNION ALL<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><span style=\"color: #D4D4D4\"> * <\/span><span style=\"color: #569CD6\">FROM<\/span><span style=\"color: #D4D4D4\"> EvaluateModel(<\/span><span style=\"color: #CE9178\">&#39;user_conversion_random_forest&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">UNION ALL<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">SELECT<\/span><span style=\"color: #D4D4D4\"> * <\/span><span style=\"color: #569CD6\">FROM<\/span><span style=\"color: #D4D4D4\"> EvaluateModel(<\/span><span style=\"color: #CE9178\">&#39;user_conversion_xgboost&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">ORDER BY<\/span><span style=\"color: #D4D4D4\"> model;<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>Confusion matrices provide a detailed breakdown of the classification results by showing the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Here&#8217;s an analysis of the confusion matrices for Logistic Regression, Random Forest, and XGBoost models.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Logistic Regression<\/h4>\n\n\n\n<p>Logistic Regression shows a high number of true negatives (59,339) and a low number of false positives (184), which indicates that it correctly identifies non-converters most of the time. However, it struggles with correctly identifying converters, as shown by the relatively high number of false negatives (791) and the low number of true positives (209). <\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><\/td><td class=\"has-text-align-right\" data-align=\"right\">Predicted: 0 (not converted)<\/td><td class=\"has-text-align-right\" data-align=\"right\">Predicted: 1 (converted)<\/td><\/tr><tr><td>Actual: 0 (not converted)<\/td><td class=\"has-text-align-right\" data-align=\"right\">(TN) 59,339<\/td><td class=\"has-text-align-right\" data-align=\"right\">(FP) 184<\/td><\/tr><tr><td>Actual: 1 (converted)<\/td><td class=\"has-text-align-right\" data-align=\"right\">(FN) 791<\/td><td class=\"has-text-align-right\" data-align=\"right\">(TP) 209<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Random Forest<\/h4>\n\n\n\n<p>The Random Forest model improves significantly over Logistic Regression in terms of detecting true positives (639 vs. 209), which indicates that it is better at identifying actual converters. The number of false negatives is also reduced to 361, meaning fewer actual converters are missed. The model also maintains a low false positive rate (126), which means it remains accurate in not falsely predicting conversions when there are none. Overall, Random Forest offers a better balance between detecting converters and avoiding false alarms.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><\/td><td class=\"has-text-align-right\" data-align=\"right\">Predicted: 0 (not converted)<\/td><td class=\"has-text-align-right\" data-align=\"right\">Predicted: 1 (converted)<\/td><\/tr><tr><td>Actual: 0 (not converted)<\/td><td class=\"has-text-align-right\" data-align=\"right\">(TN) 59,397<\/td><td class=\"has-text-align-right\" data-align=\"right\">(FP) 126<\/td><\/tr><tr><td>Actual: 1 (converted)<\/td><td class=\"has-text-align-right\" data-align=\"right\"> (FN) 361<\/td><td class=\"has-text-align-right\" data-align=\"right\"> (TP) 639<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">XGBoost<\/h4>\n\n\n\n<p>XGBoost provides a mixed performance compared to Random Forest. It has slightly more false positives (246) than Random Forest. It also shows a higher number of false negatives (512) than Random Forest, meaning it misses more actual converters. Despite having a higher true positive count (488) than Logistic Regression, XGBoost underperforms compared to Random Forest in detecting actual converters.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><\/td><td class=\"has-text-align-right\" data-align=\"right\">Predicted: 0 (not converted)<\/td><td class=\"has-text-align-right\" data-align=\"right\">Predicted: 1 (converted)<\/td><\/tr><tr><td>Actual: 0 (not converted)<\/td><td class=\"has-text-align-right\" data-align=\"right\">(TN) 59,277<\/td><td class=\"has-text-align-right\" data-align=\"right\">(FP) 246<\/td><\/tr><tr><td>Actual: 1 (converted)<\/td><td class=\"has-text-align-right\" data-align=\"right\"> (FN) 512<\/td><td class=\"has-text-align-right\" data-align=\"right\">(TP) 488<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Based on the confusion matrix analysis, <strong>Random Forest<\/strong> emerges as the top-performing model for predicting user conversion due to its superior balance between identifying actual conversions and minimizing false predictions.<\/p>\n\n\n\n<p><strong>XGBoost<\/strong> is a viable alternative but may require adjustments, while <strong>Logistic Regression<\/strong> is the least effective in this context, particularly due to its high false negative rate.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. Performance Comparison<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><\/td><td class=\"has-text-align-right\" data-align=\"right\">Logistic Regression<\/td><td class=\"has-text-align-right\" data-align=\"right\">Random Forest<\/td><td class=\"has-text-align-right\" data-align=\"right\">XGBoost<\/td><\/tr><tr><td>Precision<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.532<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.835<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.665<\/td><\/tr><tr><td>Recall<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.209<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.639<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.488<\/td><\/tr><tr><td>Accuracy<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.984<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.992<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.987<\/td><\/tr><tr><td>F1 Score<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.300<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.724<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.563<\/td><\/tr><tr><td>Log Loss<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.046<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.144<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.028<\/td><\/tr><tr><td>AUC<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.986<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.986<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.993<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>When evaluating predictive models, key metrics include precision, recall, accuracy, F1 score, Log Loss, and AUC.<\/p>\n\n\n\n<p><strong>Precision<\/strong> measures how often positive predictions are correct, while <strong>recall <\/strong>assesses how well the model identifies actual positives. <strong>Accuracy<\/strong> gives an overall correctness measure, but can be misleading in imbalanced datasets. <strong>F1 Score<\/strong> balances precision and recall. <strong>Log Loss<\/strong> penalizes incorrect predictions, and <strong>AUC<\/strong> shows the model&#8217;s ability to distinguish between classes.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>1. Precision<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Logistic Regression (0.532):<\/strong> This model shows moderate precision, meaning that 53.2% of the users it predicted as converters were indeed converters. However, this also indicates that nearly half of the positive predictions were false positives.<\/li>\n\n\n\n<li><strong>Random Forest (0.835):<\/strong> Random Forest excels in precision with 83.5%, making it very effective in minimizing false positives. This model is ideal if the goal is to avoid wasting resources on users who are unlikely to convert.<\/li>\n\n\n\n<li><strong>XGBoost (0.665):<\/strong> XGBoost offers a precision of 66.5%, which is higher than Logistic Regression but lower than Random Forest. <\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>2. Recall<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Logistic Regression (0.209):<\/strong> The recall of 20.9% suggests that Logistic Regression misses a significant portion of actual converters. While it is conservative in predicting conversions, it could lead to lost opportunities by not identifying potential converters.<\/li>\n\n\n\n<li><strong>Random Forest (0.639):<\/strong> With a recall of 63.9%, Random Forest is much better at capturing true positives. It balances between identifying converters and maintaining precision, making it a strong candidate for conversion prediction.<\/li>\n\n\n\n<li><strong>XGBoost (0.488):<\/strong> XGBoost\u2019s recall of 48.8% falls between Logistic Regression and Random Forest. It is less aggressive than Random Forest but still better than Logistic Regression in identifying potential converters.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>3. Accuracy<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Logistic Regression (0.984):<\/strong> Despite its lower recall and precision, Logistic Regression shows high accuracy at 98.4%. However, accuracy can be misleading in imbalanced datasets, where the majority class dominates.<\/li>\n\n\n\n<li><strong>Random Forest (0.992):<\/strong> Random Forest\u2019s accuracy of 99.2% indicates it performs well overall, reducing both false positives and false negatives.<\/li>\n\n\n\n<li><strong>XGBoost (0.987):<\/strong> XGBoost also shows high accuracy at 98.7%, slightly below Random Forest but still indicating strong overall performance.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>4. F1 Score<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Logistic Regression (0.300):<\/strong> The F1 score of 30.0% is low, reflecting the model&#8217;s struggle to balance precision and recall. This low score indicates that the model is not well-suited for conversion prediction, especially if both precision and recall are important.<\/li>\n\n\n\n<li><strong>Random Forest (0.724):<\/strong> Random Forest achieves a high F1 score of 72.4%, suggesting a good balance between precision and recall. This makes it a robust choice for conversion prediction where both metrics are crucial.<\/li>\n\n\n\n<li><strong>XGBoost (0.563):<\/strong> XGBoost\u2019s F1 score of 56.3% suggests it provides a better balance than Logistic Regression but is slightly less effective than Random Forest. <\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>5. Log Loss<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Logistic Regression (0.046):<\/strong> Logistic Regression has the lowest log loss at 0.046, indicating that it is very confident in its predictions, even though the overall effectiveness may be compromised by low recall.<\/li>\n\n\n\n<li><strong>Random Forest (0.144):<\/strong> Random Forest has a higher log loss of 0.144, indicating that while it is accurate, it is less confident in its predictions compared to Logistic Regression.<\/li>\n\n\n\n<li><strong>XGBoost (0.028):<\/strong> XGBoost outperforms the other models with the lowest log loss of 0.028, indicating high confidence in its predictions. This, combined with its balanced performance, makes it a strong contender.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>6. AUC<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Logistic Regression (0.986):<\/strong> The AUC of 98.6% indicates that Logistic Regression is very good at distinguishing between converters and non-converters, despite its lower recall.<\/li>\n\n\n\n<li><strong>Random Forest (0.986):<\/strong> Random Forest also achieves an AUC of 98.6%, matching Logistic Regression and showing strong overall performance in distinguishing between classes.<\/li>\n\n\n\n<li><strong>XGBoost (0.993):<\/strong> XGBoost tops the AUC at 99.3%, indicating that it is the most effective model at distinguishing between converters and non-converters, making it particularly powerful for this task.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Logistic Regression:<\/strong> Offers high accuracy and low log loss, but its low recall and F1 score suggest it is less effective for predicting user conversions when both precision and recall are essential.<\/li>\n\n\n\n<li><strong>Random Forest:<\/strong> Strikes a strong balance between precision, recall, and overall accuracy. Its high F1 score and solid AUC make it a very reliable model for conversion prediction.<\/li>\n\n\n\n<li><strong>XGBoost:<\/strong> Delivers the best AUC and the lowest log loss, indicating strong predictive power and confidence. It provides a good balance, though it may require fine-tuning to optimize precision and recall fully.<\/li>\n<\/ul>\n\n\n\n<p>For predicting user conversion, <strong>XGBoost<\/strong> stands out as a robust model, particularly for scenarios where distinguishing between converters and non-converters is critical. <\/p>\n\n\n\n<p>However, <strong>Random Forest<\/strong> remains a strong alternative, offering a slightly better balance between precision and recall.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5. Feature Importance<\/h3>\n\n\n\n<p>Given that both the Confusion Matrix results and the performance metrics comparison indicate the Random Forest model outperforms the other two models, I will focus on the Random Forest model to discuss feature importance. <\/p>\n\n\n\n<p>Understanding the key features driving the Random Forest model&#8217;s success is crucial. The model highlights &#8220;avg_time_on_site&#8221; and &#8220;total_pageviews&#8221; as the top factors, indicating that user engagement and browsing behavior are pivotal in predicting conversion potential within a non-linear context. <\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td>avg_time_on_site<\/td><td class=\"has-text-align-right\" data-align=\"right\">14436.0<\/td><\/tr><tr><td>total_pageviews<\/td><td class=\"has-text-align-right\" data-align=\"right\">14121.0<\/td><\/tr><tr><td>avg_pageviews<\/td><td class=\"has-text-align-right\" data-align=\"right\">7509.0<\/td><\/tr><tr><td>total_time_on_site<\/td><td class=\"has-text-align-right\" data-align=\"right\">7108.0<\/td><\/tr><tr><td>days_since_first_visit<\/td><td class=\"has-text-align-right\" data-align=\"right\">7085.0<\/td><\/tr><tr><td>num_visits<\/td><td class=\"has-text-align-right\" data-align=\"right\">5667.0<\/td><\/tr><tr><td>traffic_medium<\/td><td class=\"has-text-align-right\" data-align=\"right\">3484.0<\/td><\/tr><tr><td>device_category<\/td><td class=\"has-text-align-right\" data-align=\"right\">1852.0<\/td><\/tr><tr><td>country<\/td><td class=\"has-text-align-right\" data-align=\"right\">825.0<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>While the Random Forest model stands out in this analysis, it&#8217;s also worth noting that Logistic Regression and XGBoost, while slightly less effective, provide valuable insights. <\/p>\n\n\n\n<p>However, they may not fully capture the complexity of user behavior that the Random Forest model does, particularly in terms of how engagement metrics and traffic sources interact to influence conversion outcomes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6. Conclusion<\/h3>\n\n\n\n<p>In the competitive landscape of digital marketing, accurately predicting user conversion is essential for optimizing strategies and maximizing ROI.<\/p>\n\n\n\n<p>I built and compared three models: Logistic Regression, Random Forest, and XGBoost. <\/p>\n\n\n\n<p>Each model provided valuable insights, with Random Forest emerging as the most balanced in terms of precision, recall, and overall accuracy. <\/p>\n\n\n\n<p>XGBoost, while showing exceptional AUC and low log loss, also proved to be a strong contender, especially in distinguishing between converters and non-converters.<\/p>\n\n\n\n<p>The analysis of feature importance within the Random Forest model underscored the significance of user engagement metrics like average time on site and total page views in predicting conversions. <\/p>\n\n\n\n<p>These findings highlight the importance of understanding user behavior holistically to develop more effective marketing strategies.<\/p>\n\n\n\n<p>You can find&nbsp;<a href=\"https:\/\/github.com\/tomomitanaka00\/Blog-SQL\/blob\/main\/Section7.sql\">the complete code<\/a>&nbsp;in&nbsp;my GitHub repository.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In today#8217;s data-driven marketing landscape, predicting which users are likely to convert is crucial for optimizing strategies and improving ROI. In this blog post, we#8217;ll walk through the process of building user conversion prediction models using BigQuery ML and the Google Analytics Sample Dataset. We#8217;ll cover feature development, model building, and evaluation, all using SQL.<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":877,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-896","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/896","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=896"}],"version-history":[{"count":150,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/896\/revisions"}],"predecessor-version":[{"id":6372,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/896\/revisions\/6372"}],"up":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/877"}],"wp:attachment":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=896"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}