{"id":4107,"date":"2024-08-17T20:54:12","date_gmt":"2024-08-18T00:54:12","guid":{"rendered":"https:\/\/www.econai.tech\/?page_id=4107"},"modified":"2024-08-21T14:53:13","modified_gmt":"2024-08-21T18:53:13","slug":"revenue-prediction","status":"publish","type":"page","link":"https:\/\/tomomitanaka.ai\/?page_id=4107","title":{"rendered":"Revenue Prediction"},"content":{"rendered":"\n<p>Accurate revenue prediction is crucial for businesses to optimize their strategies and maximize profits. <\/p>\n\n\n\n<p>Previously, we explored <a href=\"https:\/\/www.econai.tech\/?page_id=902\">revenue prediction using BigQuery ML<\/a>, which provided a solid foundation but left room for improvement. <\/p>\n\n\n\n<p>In this post, we&#8217;ll dive into how we can enhance our revenue prediction model using Python, leveraging its flexibility and powerful libraries to achieve better results.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Table of Contents<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Motivation: Limitations of BigQuery ML<\/li>\n\n\n\n<li>Data Preparation<\/li>\n\n\n\n<li>Feature Engineering<\/li>\n\n\n\n<li>Model Selection<\/li>\n\n\n\n<li>Results and Comparison<\/li>\n\n\n\n<li>Conclusion<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Motivation: Limitations of BigQuery ML<\/h3>\n\n\n\n<p>Revenue prediction is a crucial aspect of understanding customer behavior and optimizing business strategies. While BigQuery ML offers an accessible platform for building models directly within your data warehouse, it can have limitations, particularly in handling complex interactions and feature engineering tasks.<\/p>\n\n\n\n<p>In my analysis using BigQuery ML, I implemented several models: Linear Regression, Lasso Regression, Ridge Regression, and Random Forest. <\/p>\n\n\n\n<p>The results revealed some critical insights:<\/p>\n\n\n\n<h5 class=\"wp-block-heading\"><strong>Results<\/strong> of the BigQuery ML analysis<\/h5>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><\/td><td class=\"has-text-align-right\" data-align=\"right\">Linear Regression<\/td><td class=\"has-text-align-right\" data-align=\"right\">Lasso Regression<\/td><td class=\"has-text-align-right\" data-align=\"right\">Ridge Regression<\/td><td class=\"has-text-align-right\" data-align=\"right\">Random Forest<\/td><\/tr><tr><td>Mean Absolute Error (MAE)<\/td><td class=\"has-text-align-right\" data-align=\"right\">4.532<\/td><td class=\"has-text-align-right\" data-align=\"right\">3.927<\/td><td class=\"has-text-align-right\" data-align=\"right\">4.532<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>2.420<\/strong><\/td><\/tr><tr><td>Mean Squared Error (MSE)<\/td><td class=\"has-text-align-right\" data-align=\"right\">2711.89<\/td><td class=\"has-text-align-right\" data-align=\"right\">2719.70<\/td><td class=\"has-text-align-right\" data-align=\"right\">2710.97<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>2503.89<\/strong><\/td><\/tr><tr><td>Mean Squared Log Error (MSLE)<\/td><td class=\"has-text-align-right\" data-align=\"right\">2.573<\/td><td class=\"has-text-align-right\" data-align=\"right\">1.768<\/td><td class=\"has-text-align-right\" data-align=\"right\">2.568<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>0.373<\/strong><\/td><\/tr><tr><td>Median Absolute Error (MedAE)<\/td><td class=\"has-text-align-right\" data-align=\"right\">1.554<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>1.083<\/strong><\/td><td class=\"has-text-align-right\" data-align=\"right\">1.554<\/td><td class=\"has-text-align-right\" data-align=\"right\">4.571<\/td><\/tr><tr><td>R-Squared (R\u00b2)<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.025<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.022<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.025<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>0.100<\/strong><\/td><\/tr><tr><td>Root Mean Squared Error (RMSE)<\/td><td class=\"has-text-align-right\" data-align=\"right\">52.075<\/td><td class=\"has-text-align-right\" data-align=\"right\">52.150<\/td><td class=\"has-text-align-right\" data-align=\"right\">52.067<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>50.039<\/strong><\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><br><strong>Mean Absolute Error (MAE)<\/strong>: The Random Forest model in BigQuery ML significantly outperformed the linear models, with an MAE of 2.420 compared to over 4.5 for the others.<br><strong>Mean Squared Error (MSE)<\/strong>: Again, Random Forest had the lowest MSE, indicating better overall predictive accuracy.<br><strong>Mean Squared Log Error (MSLE)<\/strong>: This metric showed a stark difference, with Random Forest achieving an MSLE of 0.373, while the linear models were significantly higher.<br><strong>R-Squared (R\u00b2)<\/strong>: Despite the improvements, the R\u00b2 values were relatively low across the board, with the highest being 0.100 for Random Forest, indicating that the models were not capturing much of the variance in the data.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Data Preparation<\/h3>\n\n\n\n<p>Working with the Google Analytics Sample Dataset presents challenges due to its size and complexity. To overcome these, we implemented the following strategies:<\/p>\n\n\n\n<p><strong>Chunked Data Processing<\/strong>: Instead of loading the entire dataset at once, we processed it in smaller, manageable chunks. This approach significantly reduced memory usage and allowed for more efficient processing.<\/p>\n\n\n\n<p><strong>Data Cleaning and Flattening<\/strong>: The raw data contains nested structures that needed to be flattened for analysis. We performed this process in chunks, saving intermediate results to disk to manage memory effectively.<\/p>\n\n\n\n<p><strong>Imputation and Scaling<\/strong>: We handled missing values through imputation and normalized features through scaling, all done in chunks to maintain efficiency.<\/p>\n\n\n\n<p><strong>Leveraging Colab and A100<\/strong>: To handle the large dataset and intensive computations, I utilize Google Colab paired with an A100 GPU. This setup provided the necessary computational power to efficiently process and analyze the data.<\/p>\n\n\n\n<p>Here&#8217;s a snippet of our data preparation process:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"def clean_and_engineer_data(df):\n    logger.info(&quot;Starting data cleaning and feature engineering...&quot;)\n\n    # Flatten any nested columns\n    df_cleaned = flatten_nested_columns(df)\n    logger.info(f&quot;Flattened DataFrame shape: {df_cleaned.shape}&quot;)\n\n    # Identify columns with complex data types (arrays, lists, etc.)\n    complex_columns = []\n    for col in df_cleaned.columns:\n        sample = df_cleaned[col].head(100)\n        if sample.apply(lambda x: isinstance(x, (list, np.ndarray))).any():\n            logger.warning(f&quot;Column '{col}' contains complex data types. Dropping this column.&quot;)\n            complex_columns.append(col)\n\n    # Drop complex columns\n    df_cleaned = df_cleaned.drop(columns=complex_columns)\n\n    # Optimize data types to reduce memory usage\n    df_cleaned = optimize_dtypes(df_cleaned)\n\n    df_cleaned['date'] = pd.to_datetime(df_cleaned['date'], format='%Y%m%d')\n    numeric_columns = df_cleaned.select_dtypes(include=[np.number]).columns\n    categorical_columns = df_cleaned.select_dtypes(exclude=[np.number, 'datetime64']).columns\n\n    # Handle all-NaN columns\n    all_nan_columns = df_cleaned.columns[df_cleaned.isna().all()].tolist()\n    if all_nan_columns:\n        df_cleaned = df_cleaned.drop(columns=all_nan_columns)\n        numeric_columns = [col for col in numeric_columns if col not in all_nan_columns]\n        categorical_columns = [col for col in categorical_columns if col not in all_nan_columns]\n\n    # Imputation\n    for col in numeric_columns:\n        df_cleaned[col] = df_cleaned[col].astype('float32')  # Ensure the column is float\n        df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].mean())  # Fill NaN with mean\n    \n    for col in categorical_columns:\n        df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].mode().iloc[0])\n\n\n    # Feature engineering\n    df_cleaned['day_of_week'] = df_cleaned['date'].dt.dayofweek\n    df_cleaned['is_weekend'] = df_cleaned['day_of_week'].isin([5, 6]).astype(int)\n    df_cleaned['month'] = df_cleaned['date'].dt.month\n    df_cleaned['quarter'] = df_cleaned['date'].dt.quarter\n\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #569CD6\">def<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #DCDCAA\">clean_and_engineer_data<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #9CDCFE\">df<\/span><span style=\"color: #D4D4D4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    logger.info(<\/span><span style=\"color: #CE9178\">&quot;Starting data cleaning and feature engineering...&quot;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Flatten any nested columns<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_cleaned = flatten_nested_columns(df)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    logger.info(<\/span><span style=\"color: #569CD6\">f<\/span><span style=\"color: #CE9178\">&quot;Flattened DataFrame shape: <\/span><span style=\"color: #569CD6\">{<\/span><span style=\"color: #D4D4D4\">df_cleaned.shape<\/span><span style=\"color: #569CD6\">}<\/span><span style=\"color: #CE9178\">&quot;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Identify columns with complex data types (arrays, lists, etc.)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    complex_columns = []<\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #C586C0\">for<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #C586C0\">in<\/span><span style=\"color: #D4D4D4\"> df_cleaned.columns:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        sample = df_cleaned[col].head(<\/span><span style=\"color: #B5CEA8\">100<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> sample.apply(<\/span><span style=\"color: #569CD6\">lambda<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #9CDCFE\">x<\/span><span style=\"color: #D4D4D4\">: <\/span><span style=\"color: #DCDCAA\">isinstance<\/span><span style=\"color: #D4D4D4\">(x, (<\/span><span style=\"color: #4EC9B0\">list<\/span><span style=\"color: #D4D4D4\">, np.ndarray))).any():<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">            logger.warning(<\/span><span style=\"color: #569CD6\">f<\/span><span style=\"color: #CE9178\">&quot;Column &#39;<\/span><span style=\"color: #569CD6\">{<\/span><span style=\"color: #D4D4D4\">col<\/span><span style=\"color: #569CD6\">}<\/span><span style=\"color: #CE9178\">&#39; contains complex data types. Dropping this column.&quot;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">            complex_columns.append(col)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Drop complex columns<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_cleaned = df_cleaned.drop(<\/span><span style=\"color: #9CDCFE\">columns<\/span><span style=\"color: #D4D4D4\">=complex_columns)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Optimize data types to reduce memory usage<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_cleaned = optimize_dtypes(df_cleaned)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;date&#39;<\/span><span style=\"color: #D4D4D4\">] = pd.to_datetime(df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;date&#39;<\/span><span style=\"color: #D4D4D4\">], <\/span><span style=\"color: #9CDCFE\">format<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;%Y%m<\/span><span style=\"color: #569CD6\">%d<\/span><span style=\"color: #CE9178\">&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    numeric_columns = df_cleaned.select_dtypes(<\/span><span style=\"color: #9CDCFE\">include<\/span><span style=\"color: #D4D4D4\">=[np.number]).columns<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    categorical_columns = df_cleaned.select_dtypes(<\/span><span style=\"color: #9CDCFE\">exclude<\/span><span style=\"color: #D4D4D4\">=[np.number, <\/span><span style=\"color: #CE9178\">&#39;datetime64&#39;<\/span><span style=\"color: #D4D4D4\">]).columns<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Handle all-NaN columns<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    all_nan_columns = df_cleaned.columns[df_cleaned.isna().all()].tolist()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> all_nan_columns:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        df_cleaned = df_cleaned.drop(<\/span><span style=\"color: #9CDCFE\">columns<\/span><span style=\"color: #D4D4D4\">=all_nan_columns)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        numeric_columns = [col <\/span><span style=\"color: #C586C0\">for<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #C586C0\">in<\/span><span style=\"color: #D4D4D4\"> numeric_columns <\/span><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #569CD6\">not<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">in<\/span><span style=\"color: #D4D4D4\"> all_nan_columns]<\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        categorical_columns = [col <\/span><span style=\"color: #C586C0\">for<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #C586C0\">in<\/span><span style=\"color: #D4D4D4\"> categorical_columns <\/span><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #569CD6\">not<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">in<\/span><span style=\"color: #D4D4D4\"> all_nan_columns]<\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Imputation<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #C586C0\">for<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #C586C0\">in<\/span><span style=\"color: #D4D4D4\"> numeric_columns:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        df_cleaned[col] = df_cleaned[col].astype(<\/span><span style=\"color: #CE9178\">&#39;float32&#39;<\/span><span style=\"color: #D4D4D4\">)  <\/span><span style=\"color: #6A9955\"># Ensure the column is float<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].mean())  <\/span><span style=\"color: #6A9955\"># Fill NaN with mean<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #C586C0\">for<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #C586C0\">in<\/span><span style=\"color: #D4D4D4\"> categorical_columns:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].mode().iloc[<\/span><span style=\"color: #B5CEA8\">0<\/span><span style=\"color: #D4D4D4\">])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Feature engineering<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;day_of_week&#39;<\/span><span style=\"color: #D4D4D4\">] = df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;date&#39;<\/span><span style=\"color: #D4D4D4\">].dt.dayofweek<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;is_weekend&#39;<\/span><span style=\"color: #D4D4D4\">] = df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;day_of_week&#39;<\/span><span style=\"color: #D4D4D4\">].isin([<\/span><span style=\"color: #B5CEA8\">5<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #B5CEA8\">6<\/span><span style=\"color: #D4D4D4\">]).astype(<\/span><span style=\"color: #4EC9B0\">int<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;month&#39;<\/span><span style=\"color: #D4D4D4\">] = df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;date&#39;<\/span><span style=\"color: #D4D4D4\">].dt.month<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;quarter&#39;<\/span><span style=\"color: #D4D4D4\">] = df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;date&#39;<\/span><span style=\"color: #D4D4D4\">].dt.quarter<\/span><\/span>\n<span class=\"line\"><\/span><\/code><\/pre><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\"><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Model Selection: Random Forest<\/h3>\n\n\n\n<p>For our revenue prediction task, we chose the Random Forest Regressor. This decision was based on several factors:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Ability to handle high-dimensional data:<\/strong> Random Forest can effectively manage the large number of features we engineered.<\/li>\n\n\n\n<li><strong>Robustness:<\/strong> It&#8217;s less prone to overfitting compared to single decision trees.<\/li>\n\n\n\n<li><strong>Feature importance:<\/strong> Random Forest provides insights into feature importance, helping us understand which factors most influence revenue.<\/li>\n\n\n\n<li><strong>Non-linear relationships:<\/strong> It can capture complex, non-linear relationships in the data.<\/li>\n\n\n\n<li><strong>Superior performance:<\/strong> The Random Forest model outperformed other models in key metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-Squared (R\u00b2) based on the BigQuery ML analysis, demonstrating its effectiveness in predicting revenue.<\/li>\n<\/ol>\n\n\n\n<h5 class=\"wp-block-heading\"><strong>Results<\/strong> of the BigQuery ML analysis<\/h5>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><\/td><td class=\"has-text-align-right\" data-align=\"right\">Linear Regression<\/td><td class=\"has-text-align-right\" data-align=\"right\">Lasso Regression<\/td><td class=\"has-text-align-right\" data-align=\"right\">Ridge Regression<\/td><td class=\"has-text-align-right\" data-align=\"right\">Random Forest<\/td><\/tr><tr><td>Mean Absolute Error (MAE)<\/td><td class=\"has-text-align-right\" data-align=\"right\">4.532<\/td><td class=\"has-text-align-right\" data-align=\"right\">3.927<\/td><td class=\"has-text-align-right\" data-align=\"right\">4.532<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>2.420<\/strong><\/td><\/tr><tr><td>Mean Squared Error (MSE)<\/td><td class=\"has-text-align-right\" data-align=\"right\">2711.89<\/td><td class=\"has-text-align-right\" data-align=\"right\">2719.70<\/td><td class=\"has-text-align-right\" data-align=\"right\">2710.97<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>2503.89<\/strong><\/td><\/tr><tr><td>Mean Squared Log Error (MSLE)<\/td><td class=\"has-text-align-right\" data-align=\"right\">2.573<\/td><td class=\"has-text-align-right\" data-align=\"right\">1.768<\/td><td class=\"has-text-align-right\" data-align=\"right\">2.568<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>0.373<\/strong><\/td><\/tr><tr><td>Median Absolute Error (MedAE)<\/td><td class=\"has-text-align-right\" data-align=\"right\">1.554<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>1.083<\/strong><\/td><td class=\"has-text-align-right\" data-align=\"right\">1.554<\/td><td class=\"has-text-align-right\" data-align=\"right\">4.571<\/td><\/tr><tr><td>R-Squared (R\u00b2)<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.025<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.022<\/td><td class=\"has-text-align-right\" data-align=\"right\">0.025<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>0.100<\/strong><\/td><\/tr><tr><td>Root Mean Squared Error (RMSE)<\/td><td class=\"has-text-align-right\" data-align=\"right\">52.075<\/td><td class=\"has-text-align-right\" data-align=\"right\">52.150<\/td><td class=\"has-text-align-right\" data-align=\"right\">52.067<\/td><td class=\"has-text-align-right\" data-align=\"right\"><strong>50.039<\/strong><\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><br><\/figcaption><\/figure>\n\n\n\n<pre class=\"wp-block-preformatted\"><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Results and Comparison<\/h3>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Conclusion<\/h3>\n\n\n\n<div class=\"wp-block-jin-gb-block-icon-box jin-icon-caution jin-iconbox\"><div class=\"jin-iconbox-icons\"><i class=\"jic jin-ifont-caution jin-icons\"><\/i><\/div><div class=\"jin-iconbox-main\">\n<p>Due to the large size of the dataset, I am currently encountering challenges in obtaining prediction results. Stay tuned for updates as I work to resolve these issues!<\/p>\n<\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Accurate revenue prediction is crucial for businesses to optimize their strategies and maximize profits. Previously, we explored revenue prediction using BigQuery ML, which provided a solid foundation but left room for improvement. In this post, we#8217;ll dive into how we can enhance our revenue prediction model using Python, leveraging its flexibility and powerful libraries to<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":3822,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-4107","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/4107","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4107"}],"version-history":[{"count":20,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/4107\/revisions"}],"predecessor-version":[{"id":5010,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/4107\/revisions\/5010"}],"up":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/3822"}],"wp:attachment":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4107"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}