{"id":4103,"date":"2024-08-17T20:52:35","date_gmt":"2024-08-18T00:52:35","guid":{"rendered":"https:\/\/www.econai.tech\/?page_id=4103"},"modified":"2024-08-18T07:04:33","modified_gmt":"2024-08-18T11:04:33","slug":"data-cleaning","status":"publish","type":"page","link":"https:\/\/tomomitanaka.ai\/?page_id=4103","title":{"rendered":"Data Cleaning"},"content":{"rendered":"\n<p>Data cleaning is a crucial step in the data science pipeline. It involves preparing and transforming raw data into a format suitable for analysis. <\/p>\n\n\n\n<p>In this post, we\u2019ll explore how to clean and prepare data from <a href=\"https:\/\/support.google.com\/analytics\/answer\/7586738?hl=en#zippy=%2Cin-this-article\">Google Analytics Sample Dataset in BigQuery<\/a> for various machine learning tasks. <\/p>\n\n\n\n<p>We\u2019ll go through key steps such as handling missing values, feature scaling, and encoding categorical variables, using Python and popular data processing libraries.<\/p>\n\n\n\n<p><br>You can find&nbsp;<a href=\"https:\/\/github.com\/tomomitanaka00\/Blog-SQL\/blob\/main\/Ecommerce.ipynb\">the complete code<\/a>&nbsp;in my GitHub repository.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Importance of Data Cleaning<\/h3>\n\n\n\n<p>This data cleaning process transforms our raw, messy data into a structured format that&#8217;s ready for analysis. <\/p>\n\n\n\n<p>By flattening nested structures, handling missing data, converting data types, and encoding categorical variables, we&#8217;ve addressed many common data quality issues.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-jin-gb-block-box-with-headline kaisetsu-box1\"><div class=\"kaisetsu-box1-title\">Clean data allows us to<\/div>\n<ol class=\"wp-block-list\">\n<li>Perform more accurate analyses<\/li>\n\n\n\n<li>Build more reliable machine learning models<\/li>\n\n\n\n<li>Make data-driven decisions with confidence<\/li>\n\n\n\n<li>Save time in the long run by avoiding errors and inconsistencies<\/li>\n<\/ol>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">The Dataset<\/h3>\n\n\n\n<p>We&#8217;re working with Google Analytics data, which is stored in BigQuery. This dataset contains rich information about website visits, including user interactions, traffic sources, and transaction data. <\/p>\n\n\n\n<p>However, it comes with its own set of challenges, including nested data structures.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Data Cleaning Process<\/h3>\n\n\n\n<p>Let&#8217;s break down our data cleaning process into several key steps.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">1. Data Extraction<\/h4>\n\n\n\n<p>First, we extract the data from BigQuery using a SQL query:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"query = &quot;&quot;&quot;\nSELECT *\nFROM bigquery-public-data.google_analytics_sample.ga_sessions_20170801\nLIMIT 10000\n&quot;&quot;&quot;\ndf = client.query(query).to_dataframe()\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D4D4D4\">query = <\/span><span style=\"color: #CE9178\">&quot;&quot;&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #CE9178\">SELECT *<\/span><\/span>\n<span class=\"line\"><span style=\"color: #CE9178\">FROM bigquery-public-data.google_analytics_sample.ga_sessions_20170801<\/span><\/span>\n<span class=\"line\"><span style=\"color: #CE9178\">LIMIT 10000<\/span><\/span>\n<span class=\"line\"><span style=\"color: #CE9178\">&quot;&quot;&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">df = client.query(query).to_dataframe()<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2. Flattening Nested Columns<\/h4>\n\n\n\n<p>One of the challenges with this dataset is that some columns contain nested data. We flatten these nested structures to make the data easier to work with:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"def flatten_nested_columns(df):\n    nested_columns = ['totals', 'trafficSource', 'device', 'geoNetwork']\n    flattened_dfs = []\n\n    for col in nested_columns:\n        flattened = pd.json_normalize(df[col])\n        flattened.columns = [f'{col}_{subcol}' for subcol in flattened.columns]\n        flattened_dfs.append(flattened)\n\n    df_flattened = pd.concat([df.drop(nested_columns, axis=1)] + flattened_dfs, axis=1)\n    \n    return df_flattened\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #569CD6\">def<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #DCDCAA\">flatten_nested_columns<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #9CDCFE\">df<\/span><span style=\"color: #D4D4D4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    nested_columns = [<\/span><span style=\"color: #CE9178\">&#39;totals&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;trafficSource&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;device&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;geoNetwork&#39;<\/span><span style=\"color: #D4D4D4\">]<\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    flattened_dfs = []<\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #C586C0\">for<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #C586C0\">in<\/span><span style=\"color: #D4D4D4\"> nested_columns:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        flattened = pd.json_normalize(df[col])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        flattened.columns = [<\/span><span style=\"color: #569CD6\">f<\/span><span style=\"color: #CE9178\">&#39;<\/span><span style=\"color: #569CD6\">{<\/span><span style=\"color: #D4D4D4\">col<\/span><span style=\"color: #569CD6\">}<\/span><span style=\"color: #CE9178\">_<\/span><span style=\"color: #569CD6\">{<\/span><span style=\"color: #D4D4D4\">subcol<\/span><span style=\"color: #569CD6\">}<\/span><span style=\"color: #CE9178\">&#39;<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #C586C0\">for<\/span><span style=\"color: #D4D4D4\"> subcol <\/span><span style=\"color: #C586C0\">in<\/span><span style=\"color: #D4D4D4\"> flattened.columns]<\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        flattened_dfs.append(flattened)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_flattened = pd.concat([df.drop(nested_columns, <\/span><span style=\"color: #9CDCFE\">axis<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #B5CEA8\">1<\/span><span style=\"color: #D4D4D4\">)] + flattened_dfs, <\/span><span style=\"color: #9CDCFE\">axis<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #B5CEA8\">1<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #C586C0\">return<\/span><span style=\"color: #D4D4D4\"> df_flattened<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">3. Handling Missing Data<\/h4>\n\n\n\n<p>Missing data can skew our analysis, so we handle it using imputation techniques:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"numeric_imputer = SimpleImputer(strategy='median')\ndf_cleaned[numeric_columns] = numeric_imputer.fit_transform(df_cleaned[numeric_columns])\n\ncategorical_imputer = SimpleImputer(strategy='most_frequent')\ndf_cleaned[categorical_columns] = categorical_imputer.fit_transform(df_cleaned[categorical_columns])\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D4D4D4\">numeric_imputer = SimpleImputer(<\/span><span style=\"color: #9CDCFE\">strategy<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;median&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">df_cleaned[numeric_columns] = numeric_imputer.fit_transform(df_cleaned[numeric_columns])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">categorical_imputer = SimpleImputer(<\/span><span style=\"color: #9CDCFE\">strategy<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;most_frequent&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">df_cleaned[categorical_columns] = categorical_imputer.fit_transform(df_cleaned[categorical_columns])<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">4. Data Type Conversion<\/h4>\n\n\n\n<p>We ensure that our data types are consistent and appropriate for analysis:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"df_cleaned['date'] = pd.to_datetime(df_cleaned['date'], format='%Y%m%d')\n\nfor col in df_cleaned.columns:\n    if col in numeric_columns:\n        df_cleaned[col] = pd.to_numeric(df_cleaned[col], errors='coerce')\n    elif col in categorical_columns:\n        df_cleaned[col] = df_cleaned[col].astype(str)\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D4D4D4\">df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;date&#39;<\/span><span style=\"color: #D4D4D4\">] = pd.to_datetime(df_cleaned[<\/span><span style=\"color: #CE9178\">&#39;date&#39;<\/span><span style=\"color: #D4D4D4\">], <\/span><span style=\"color: #9CDCFE\">format<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;%Y%m<\/span><span style=\"color: #569CD6\">%d<\/span><span style=\"color: #CE9178\">&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #C586C0\">for<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #C586C0\">in<\/span><span style=\"color: #D4D4D4\"> df_cleaned.columns:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #569CD6\">in<\/span><span style=\"color: #D4D4D4\"> numeric_columns:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        df_cleaned[col] = pd.to_numeric(df_cleaned[col], <\/span><span style=\"color: #9CDCFE\">errors<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;coerce&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #C586C0\">elif<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #569CD6\">in<\/span><span style=\"color: #D4D4D4\"> categorical_columns:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        df_cleaned[col] = df_cleaned[col].astype(<\/span><span style=\"color: #4EC9B0\">str<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">5. Feature Scaling<\/h4>\n\n\n\n<p>Scaling is essential for machine learning algorithms that rely on distance metrics, such as linear regression and SVMs. We\u2019ll standardize our numeric features using StandardScaler.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"scaler = StandardScaler()\ndf_cleaned[numeric_columns] = scaler.fit_transform(df_cleaned[numeric_columns])\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D4D4D4\">scaler = StandardScaler()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">df_cleaned[numeric_columns] = scaler.fit_transform(df_cleaned[numeric_columns])<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">6. Encoding Categorical Variables<\/h4>\n\n\n\n<p>Categorical variables need to be converted into numerical format before they can be used in machine learning models. We\u2019ll use LabelEncoder for this purpose.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"le = LabelEncoder()\nfor col in categorical_columns:\n    df_cleaned[col] = le.fit_transform(df_cleaned[col])\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D4D4D4\">le = LabelEncoder()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #C586C0\">for<\/span><span style=\"color: #D4D4D4\"> col <\/span><span style=\"color: #C586C0\">in<\/span><span style=\"color: #D4D4D4\"> categorical_columns:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_cleaned[col] = le.fit_transform(df_cleaned[col])<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">7. Extracting Hit-Level Data<\/h4>\n\n\n\n<p>In addition to session-level data, we can extract hit-level data, which provides more granular information about user interactions.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 10px 16px;margin-bottom:-2px;width:100%;text-align:left;background-color:#2b2b2b;color:#c7c7c7\">Python<\/span><span role=\"button\" tabindex=\"0\" data-code=\"def extract_hit_level_data(df):\n    # Flatten the main dataframe\n    df_flat = pd.json_normalize(df.to_dict('records'))\n    \n    # Extract hit-level data\n    hits_data = df_flat.apply(lambda row: pd.json_normalize(row['hits']), axis=1).explode().reset_index()\n    \n    # Rename 'index' to 'session_id' for clarity\n    hits_data = hits_data.rename(columns={'index': 'session_id'})\n    \n    # Add session-level identifiers to the hit-level data\n    hits_data['fullVisitorId'] = df_flat['fullVisitorId'].repeat(df_flat['hits'].str.len()).reset_index(drop=True)\n    hits_data['visitId'] = df_flat['visitId'].repeat(df_flat['hits'].str.len()).reset_index(drop=True)\n    hits_data['date'] = pd.to_datetime(df_flat['date'].repeat(df_flat['hits'].str.len()).reset_index(drop=True), format='%Y%m%d')\n    \n    # Select and rename relevant columns\n    columns_to_keep = [\n        'session_id', 'fullVisitorId', 'visitId', 'date',\n        'hitNumber', 'time', 'hour', 'minute',\n        'isEntrance', 'isExit', 'page.pagePath', 'page.pageTitle',\n        'eventInfo.eventCategory', 'eventInfo.eventAction', 'eventInfo.eventLabel',\n        'transaction.transactionId', 'transaction.transactionRevenue',\n        'item.productName', 'item.productCategory', 'item.productSKU', 'item.itemRevenue'\n    ]\n    \n    hits_df = hits_data[columns_to_keep].copy()\n    \n    return hits_df\n\nhit_level_df = extract_hit_level_data(df)\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #569CD6\">def<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #DCDCAA\">extract_hit_level_data<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #9CDCFE\">df<\/span><span style=\"color: #D4D4D4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Flatten the main dataframe<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    df_flat = pd.json_normalize(df.to_dict(<\/span><span style=\"color: #CE9178\">&#39;records&#39;<\/span><span style=\"color: #D4D4D4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Extract hit-level data<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    hits_data = df_flat.apply(<\/span><span style=\"color: #569CD6\">lambda<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #9CDCFE\">row<\/span><span style=\"color: #D4D4D4\">: pd.json_normalize(row[<\/span><span style=\"color: #CE9178\">&#39;hits&#39;<\/span><span style=\"color: #D4D4D4\">]), <\/span><span style=\"color: #9CDCFE\">axis<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #B5CEA8\">1<\/span><span style=\"color: #D4D4D4\">).explode().reset_index()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Rename &#39;index&#39; to &#39;session_id&#39; for clarity<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    hits_data = hits_data.rename(<\/span><span style=\"color: #9CDCFE\">columns<\/span><span style=\"color: #D4D4D4\">={<\/span><span style=\"color: #CE9178\">&#39;index&#39;<\/span><span style=\"color: #D4D4D4\">: <\/span><span style=\"color: #CE9178\">&#39;session_id&#39;<\/span><span style=\"color: #D4D4D4\">})<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Add session-level identifiers to the hit-level data<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    hits_data[<\/span><span style=\"color: #CE9178\">&#39;fullVisitorId&#39;<\/span><span style=\"color: #D4D4D4\">] = df_flat[<\/span><span style=\"color: #CE9178\">&#39;fullVisitorId&#39;<\/span><span style=\"color: #D4D4D4\">].repeat(df_flat[<\/span><span style=\"color: #CE9178\">&#39;hits&#39;<\/span><span style=\"color: #D4D4D4\">].str.len()).reset_index(<\/span><span style=\"color: #9CDCFE\">drop<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #569CD6\">True<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    hits_data[<\/span><span style=\"color: #CE9178\">&#39;visitId&#39;<\/span><span style=\"color: #D4D4D4\">] = df_flat[<\/span><span style=\"color: #CE9178\">&#39;visitId&#39;<\/span><span style=\"color: #D4D4D4\">].repeat(df_flat[<\/span><span style=\"color: #CE9178\">&#39;hits&#39;<\/span><span style=\"color: #D4D4D4\">].str.len()).reset_index(<\/span><span style=\"color: #9CDCFE\">drop<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #569CD6\">True<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    hits_data[<\/span><span style=\"color: #CE9178\">&#39;date&#39;<\/span><span style=\"color: #D4D4D4\">] = pd.to_datetime(df_flat[<\/span><span style=\"color: #CE9178\">&#39;date&#39;<\/span><span style=\"color: #D4D4D4\">].repeat(df_flat[<\/span><span style=\"color: #CE9178\">&#39;hits&#39;<\/span><span style=\"color: #D4D4D4\">].str.len()).reset_index(<\/span><span style=\"color: #9CDCFE\">drop<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #569CD6\">True<\/span><span style=\"color: #D4D4D4\">), <\/span><span style=\"color: #9CDCFE\">format<\/span><span style=\"color: #D4D4D4\">=<\/span><span style=\"color: #CE9178\">&#39;%Y%m<\/span><span style=\"color: #569CD6\">%d<\/span><span style=\"color: #CE9178\">&#39;<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #6A9955\"># Select and rename relevant columns<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    columns_to_keep = [<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #CE9178\">&#39;session_id&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;fullVisitorId&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;visitId&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;date&#39;<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #CE9178\">&#39;hitNumber&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;time&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;hour&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;minute&#39;<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #CE9178\">&#39;isEntrance&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;isExit&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;page.pagePath&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;page.pageTitle&#39;<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #CE9178\">&#39;eventInfo.eventCategory&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;eventInfo.eventAction&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;eventInfo.eventLabel&#39;<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #CE9178\">&#39;transaction.transactionId&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;transaction.transactionRevenue&#39;<\/span><span style=\"color: #D4D4D4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">        <\/span><span style=\"color: #CE9178\">&#39;item.productName&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;item.productCategory&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;item.productSKU&#39;<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #CE9178\">&#39;item.itemRevenue&#39;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    ]<\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    hits_df = hits_data[columns_to_keep].copy()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #C586C0\">return<\/span><span style=\"color: #D4D4D4\"> hits_df<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">hit_level_df = extract_hit_level_data(df)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h3>\n\n\n\n<p>In this blog post, we walked through the process of cleaning and preparing the Google Analytics Sample Dataset for machine learning tasks. <\/p>\n\n\n\n<p>By flattening nested columns, handling missing values, scaling features, and encoding categorical variables, we prepared the data for advanced analysis, such as predicting e-commerce sales, customer segmentation, and more. <\/p>\n\n\n\n<p>This cleaned dataset can now be used to train and evaluate machine learning models, leading to actionable insights and data-driven decisions.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data cleaning is a crucial step in the data science pipeline. It involves preparing and transforming raw data into a format suitable for analysis. In this post, we\u2019ll explore how to clean and prepare data from Google Analytics Sample Dataset in BigQuery for various machine learning tasks. We\u2019ll go through key steps such as handling<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":3822,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-4103","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/4103","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4103"}],"version-history":[{"count":35,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/4103\/revisions"}],"predecessor-version":[{"id":4187,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/4103\/revisions\/4187"}],"up":[{"embeddable":true,"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=\/wp\/v2\/pages\/3822"}],"wp:attachment":[{"href":"https:\/\/tomomitanaka.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4103"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}