{"id":4105,"date":"2024-08-17T20:53:34","date_gmt":"2024-08-18T00:53:34","guid":{"rendered":"https:\/\/www.econai.tech\/?page_id=4105"},"modified":"2024-08-18T20:43:21","modified_gmt":"2024-08-19T00:43:21","slug":"sales-prediction","status":"publish","type":"page","link":"https:\/\/tomomitanaka.ai\/?page_id=4105","title":{"rendered":"Sales Prediction"},"content":{"rendered":"\n<p>In this analysis, I explore the process of predicting e-commerce sales using Python, leveraging data from the &nbsp;<a href=\"https:\/\/support.google.com\/analytics\/answer\/7586738?hl=en#zippy=%2Cin-this-article\">Google Analytics Sample Dataset in BigQuery<\/a>. <\/p>\n\n\n\n<p>While BigQuery ML provides a powerful and scalable platform for building machine learning models, Python offers more flexibility and control, which can be particularly beneficial when working with complex data processing and custom model development. <\/p>\n\n\n\n<p>However, handling large datasets in Python presents unique challenges, especially in terms of memory management and processing efficiency.<\/p>\n\n\n\n<p>You can find&nbsp;<a href=\"https:\/\/github.com\/tomomitanaka00\/Blog-SQL\/blob\/main\/Sales_Prediction.ipynb\">the complete code<\/a>&nbsp;in&nbsp;my GitHub repository.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Contents<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Results of BigQuery ML Analysis as Motivation<\/li>\n\n\n\n<li>Data Preparation<\/li>\n\n\n\n<li>Feature Engineering<\/li>\n\n\n\n<li>Random Forest Model<\/li>\n\n\n\n<li>Results<\/li>\n\n\n\n<li>Conclusion<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">1. Results of BigQuery ML Analysis as Motivation<\/h3>\n\n\n\n<p>The <a href=\"https:\/\/www.econai.tech\/?page_id=883\">initial BigQuery ML analysis <\/a>provided a solid foundation for understanding the predictive capabilities of models like Logistic Regression, Random Forest, and XGBoost. <\/p>\n\n\n\n<p>While BigQuery ML&#8217;s scalability and SQL integration enabled rapid model development, limitations in model complexity and feature engineering prompted further exploration using Python.<\/p>\n\n\n\n<p>Upon expanding the feature set, we observed near-perfect performance across all models:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Model<\/td><td>AUC<\/td><td>Precision<\/td><td>Recall<\/td><td>Accuracy<\/td><td>F1 Score<\/td><\/tr><tr><td>Logistic Regression<\/td><td>1.00<\/td><td>0.99<\/td><td>0.99<\/td><td>1.00<\/td><td>0.99<\/td><\/tr><tr><td>Random Forest Classifier<\/td><td>1.00<\/td><td>1.00<\/td><td>1.00<\/td><td>1.00<\/td><td>1.00<\/td><\/tr><tr><td>XGBoost<\/td><td>1.00<\/td><td>1.00<\/td><td>1.00<\/td><td>1.00<\/td><td>1.00<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The perfect precision, recall, and F1 scores achieved by Random Forest and XGBoost highlight their ability to accurately classify sales events without errors.<\/p>\n\n\n\n<p>This significant improvement underscores the value of comprehensive feature engineering and the power of ensemble methods like Random Forest in managing complex datasets. Given its robustness and interpretability, Random Forest was chosen as the optimal model for predicting e-commerce sales in this analysis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2. Data Preparation<\/h3>\n\n\n\n<p>Working with large datasets in Python presents significant challenges, especially regarding memory usage and processing time. The Google Analytics Sample Dataset, with millions of rows, requires careful handling to avoid memory overload and ensure efficient processing.<\/p>\n\n\n\n<div class=\"wp-block-jin-gb-block-box-with-headline kaisetsu-box1\"><div class=\"kaisetsu-box1-title\"><strong>Techniques Used to Handle Big Data<\/strong><\/div>\n<p>1) <strong>Chunked Data Processing<\/strong><\/p>\n\n\n\n<p>Instead of loading the entire dataset into memory at once, I processed the data in smaller chunks. <\/p>\n\n\n\n<p>By fetching and processing data in one-day increments, I reduced the memory footprint and allowed for more manageable processing.<\/p>\n\n\n\n<p>2) <strong>Data Cleaning and Flattening<\/strong><\/p>\n\n\n\n<p>The raw data contains nested and JSON-like structures that need to be flattened for easier analysis. <\/p>\n\n\n\n<p>This process is memory-intensive, so it was important to clean and flatten the data in chunks, saving the intermediate results to disk.<\/p>\n\n\n\n<p><strong>Imputation and Scaling<\/strong><\/p>\n\n\n\n<p>To prepare the data for modeling, I applied imputation to handle missing values and scaling to normalize the features. <\/p>\n\n\n\n<p>These steps were also performed in chunks to manage memory usage.<\/p>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">3. Feature engineering<\/h3>\n\n\n\n<p>Feature engineering is a critical step in preparing the dataset for modeling. In this analysis, I derived new features that capture time-based patterns, user engagement, device type, traffic source, and geographical information.<\/p>\n\n\n\n<div class=\"wp-block-jin-gb-block-box-with-headline kaisetsu-box1\"><div class=\"kaisetsu-box1-title\"><strong>Challenges and Solutions<\/strong><\/div>\n<p><strong>High Dimensionality:<\/strong> The dataset contains many categorical features that, when encoded, can lead to high dimensionality. This increases the computational burden and can negatively impact model performance. To address this, I carefully selected and engineered features that are most likely to influence sales predictions.<\/p>\n\n\n\n<p><strong>Feature Interactions:<\/strong> Capturing interactions between features (e.g., between traffic source and device type) can be crucial for model performance. In Python, I used custom feature engineering techniques to create these interactions.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Example of feature engineering\ndf_engineered = engineer_features(df_cleaned)\n\n# Optional: Save the engineered features for later use\ndf_engineered.to_csv('engineered_features.csv', index=False)\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #6A9955\"># Example of feature engineering<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">df_engineered = engineer_features(df_cleaned)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A9955\"># Optional: Save the engineered features for later use<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">df_engineered.to_csv(<\/span><span style=\"color: #CE9178\">&#39;engineered_features.csv&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">index<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #569CD6\">False<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. Random Forest Model<\/h3>\n\n\n\n<p>To predict e-commerce sales, I selected the Random Forest Regressor due to its robustness and ability to handle high-dimensional data with minimal preprocessing. The model was trained on the engineered features, and its performance was evaluated using standard regression metrics.<\/p>\n\n\n\n<div class=\"wp-block-jin-gb-block-box-with-headline kaisetsu-box1\"><div class=\"kaisetsu-box1-title\"><strong>Techniques for Efficient Model Training<\/strong><\/div>\n<p><strong>Feature Selection:<\/strong> By carefully selecting features that have the most predictive power, I reduced the model&#8217;s complexity and training time.<\/p>\n\n\n\n<p><strong>Parallel Processing:<\/strong> Random Forest models can be trained in parallel, which speeds up the training process, especially when dealing with large datasets.<\/p>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">5. Results<\/h3>\n\n\n\n<p>The sales prediction analysis using a Random Forest Regressor yielded highly accurate results, as indicated by the evaluation metrics. Here\u2019s a summary of the key findings:<\/p>\n\n\n\n<p><strong>Model Performance<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Mean Squared Error (MSE)<\/strong>: The model achieved a very low MSE of <code>6.6751e-05<\/code>, indicating that the average squared difference between the actual and predicted sales values is minimal.<\/li>\n\n\n\n<li><strong>Root Mean Squared Error (RMSE)<\/strong>: The RMSE, which represents the standard deviation of the prediction errors, was also very low at <code>0.00817<\/code>. This suggests that the model\u2019s predictions are close to the actual sales values.<\/li>\n\n\n\n<li><strong>Mean Absolute Error (MAE)<\/strong>: The MAE, representing the average absolute difference between actual and predicted sales, was <code>7.26e-05<\/code>, further confirming the model\u2019s accuracy.<\/li>\n\n\n\n<li><strong>R-squared (R\u00b2) Score<\/strong>: The model achieved an impressive R\u00b2 score of <code>0.99988<\/code>, indicating that nearly all of the variance in the sales data is explained by the model.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Visualization<\/strong><\/h4>\n\n\n\n<p>We generated an <strong>Actual vs. Predicted Sales<\/strong> scatter plot, showing a near-perfect alignment between the actual and predicted values, further illustrating the model\u2019s accuracy.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"600\" src=\"https:\/\/www.econai.tech\/wp-content\/uploads\/2024\/08\/actual_vs_predicted_sales.png\" alt=\"\" class=\"wp-image-4344\" srcset=\"https:\/\/tomomitanaka.ai\/wp-content\/uploads\/2024\/08\/actual_vs_predicted_sales.png 1000w, https:\/\/tomomitanaka.ai\/wp-content\/uploads\/2024\/08\/actual_vs_predicted_sales-300x180.png 300w, https:\/\/tomomitanaka.ai\/wp-content\/uploads\/2024\/08\/actual_vs_predicted_sales-768x461.png 768w, https:\/\/www.econai.tech\/wp-content\/uploads\/2024\/08\/actual_vs_predicted_sales.png 856w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">6. <strong>Conclusion<\/strong><\/h3>\n\n\n\n<p>In this analysis, I demonstrated the process of predicting e-commerce sales using Python, building upon the foundation established by an initial exploration with BigQuery ML. <\/p>\n\n\n\n<p>While BigQuery ML offered rapid model development and scalability within a SQL-based environment, Python provided the flexibility and control necessary for more advanced data processing and custom model development.<\/p>\n\n\n\n<div class=\"wp-block-jin-gb-block-box-with-headline kaisetsu-box1\"><div class=\"kaisetsu-box1-title\"><strong>Advantages of BigQuery ML<\/strong><\/div>\n<p><strong>Scalability and Speed:<\/strong> <br>BigQuery ML excels at handling large datasets directly within the data warehouse, allowing for quick iteration and evaluation of machine learning models without the need to export data to external platforms.<\/p>\n\n\n\n<p><strong>Ease of Use:<\/strong> <br>Its seamless integration with SQL makes it accessible to data analysts familiar with SQL, streamlining the process of model creation and evaluation.<\/p>\n\n\n\n<p><strong>Rapid Prototyping:<\/strong> <br>BigQuery ML is ideal for rapidly developing models and gaining initial insights, especially when working within the Google Cloud ecosystem.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-jin-gb-block-box-with-headline kaisetsu-box1\"><div class=\"kaisetsu-box1-title\"><strong>Disadvantages of BigQuery ML<\/strong><\/div>\n<p><strong>Limited Customization:<\/strong> <br>BigQuery ML offers less flexibility in feature engineering and model customization compared to Python, making it challenging to implement complex models or tailor the modeling process to specific needs.<\/p>\n\n\n\n<p><strong>Model Complexity Constraints:<\/strong> <br>The platform has limitations on model complexity and size, which can hinder the development of highly sophisticated models, particularly when working with large feature sets or complex data interactions.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-jin-gb-block-box-with-headline kaisetsu-box1\"><div class=\"kaisetsu-box1-title\"><strong>Advantages of Python<\/strong><\/div>\n<p><strong>Flexibility and Control:<\/strong> <br>Python provides unparalleled flexibility in data processing, feature engineering, and model development. It allows for the implementation of advanced techniques and custom workflows tailored to specific data and business requirements.<\/p>\n\n\n\n<p><strong>Advanced Feature Engineering:<\/strong> <br>Python enables the creation of complex features and interactions, which can significantly enhance model performance, particularly in cases where the relationships within the data are intricate.<\/p>\n\n\n\n<p><strong>Robust Model Selection:<\/strong> <br>With Python, I could choose and fine-tune a Random Forest Regressor, a model well-suited for handling high-dimensional data and providing interpretability, ultimately leading to highly accurate sales predictions.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-jin-gb-block-box-with-headline kaisetsu-box1\"><div class=\"kaisetsu-box1-title\"><strong>Disadvantages of Python<\/strong><\/div>\n<p><strong>Memory Management:<\/strong> <br>Handling large datasets in Python can be challenging, particularly in terms of memory usage and processing efficiency. Careful management and optimization strategies, such as chunked data processing, are essential to prevent memory overload.<\/p>\n\n\n\n<p><strong>Processing Time:<\/strong> <br>Python may require longer processing times for large-scale data operations, especially when compared to the optimized infrastructure provided by BigQuery ML.<\/p>\n<\/div>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Final Thoughts<\/strong><\/h4>\n\n\n\n<p>The combined use of BigQuery ML and Python allowed for a comprehensive approach to e-commerce sales prediction, leveraging the strengths of both platforms. <\/p>\n\n\n\n<p>BigQuery ML provided a quick and scalable environment for initial exploration, while Python enabled deeper analysis and refinement, resulting in a robust and highly accurate predictive model. <\/p>\n\n\n\n<p>This approach highlights the importance of selecting the right tools for different stages of data analysis, balancing the need for speed, flexibility, and precision when working with big data.<\/p>\n\n\n\n<p>.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this analysis, I explore the process of predicting e-commerce sales using Python, leveraging data from the Google Analytics Sample Dataset in BigQuery. While BigQuery ML provides a powerful and scalable platform for building machine learning models, Python offers more flexibility and control, which can be particularly beneficial when working with complex data processing and<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":3822,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-4105","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/4105","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4105"}],"version-history":[{"count":75,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/4105\/revisions"}],"predecessor-version":[{"id":4390,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/4105\/revisions\/4390"}],"up":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/3822"}],"wp:attachment":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4105"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}