Missing Value Imputation, Explained

Google and Microsoft underreport emissions by 662%

In partnership with

Welcome to learning edition of the Data Pragmatist, your dose of all things data science and AI.

๐Ÿ“– Estimated Reading Time: 5 minutes. Missed our previous editions?

๐ŸŒ Google and Microsoft underreport emissions by 662% LINK

  • According to The Guardian, actual greenhouse gas emissions from company-owned data centers of Google, Microsoft, Meta, and Apple are about 662% higher than their official reports between 2020 and 2022.

  • Amazon was identified as the largest emitter among these tech firms in 2022, with its emissions being more than twice those of Apple.

  • If the emissions from these five tech giants were combined, they would rank as the 33rd highest-emitting nation, according to industry experts.

๐Ÿ‘€ Elon Musk and Larry Ellison begged Nvidia CEO Jensen Huang for AI GPUs LINK

  • Oracle co-founder Larry Ellison revealed he and Elon Musk begged Nvidia CEO Jensen Huang for more AI chips during a dinner meeting, citing high demand for graphics processing units.

  • Ellison emphasized the urgency by repeatedly asking Nvidia to take more money, a plea which he said was successfully received during an Oracle investor event last week.

  • Ellison's fortune has grown significantly as Oracle's cloud business thrives, with the company's shares rising almost 61% this year and indicating increased demand for Nvidia GPU clusters to support AI models.

All your news. None of the bias.

Be the smartest person in the room by reading 1440! Dive into 1440, where 3.5 million readers find their daily, fact-based news fix. We navigate through 100+ sources to deliver a comprehensive roundup from every corner of the internet โ€“ politics, global events, business, and culture, all in a quick, 5-minute newsletter. It's completely free and devoid of bias or political influence, ensuring you get the facts straight.

๐Ÿง  Missing Value Imputation, Explained

The problem of missing values in data is one every data scientist or analyst will encounter. Letโ€™s explore six methods to tackle this issue, showing how different approaches can be applied to impute missing values based on the nature of your data.

What Are Missing Values?

Missing values, often represented as NaN or NULL, refer to the absence of data in a dataset. They arise from various reasons, like data entry errors, sensor malfunctions, survey non-responses, and more.

Why Do Missing Values Occur?

Here are some common reasons:

  • Data Entry Errors: Someone might forget to input a value.

  • Sensor Malfunctions: In scientific experiments, sensors may fail.

  • Survey Non-Response: Respondents skip questions they donโ€™t wish to answer.

  • Merged Datasets: Missing entries may occur when combining multiple data sources.

  • Data Corruption: Values can be corrupted during transfer.

  • Sampling Issues: Some data points might be missed during collection.

Types of Missing Data

  1. MCAR (Missing Completely at Random): Missingness is unrelated to other data.

  2. MAR (Missing at Random): Missingness depends on other observed variables.

  3. MNAR (Missing Not at Random): The missingness depends on the value of the missing data itself.

The Dataset

Weโ€™re working with a golf course dataset with various columns like Date, Weekday, Holiday, Temp, Humidity, Wind, Outlook, and Crowdedness. It contains several missing values, making it ideal for exploring multiple imputation methods.

import pandas as pd
import numpy as np

data = {
    'Date': ['08-01', '08-02', '08-03', ...],
    'Weekday': [0, 1, 2, ...],
    'Holiday': [0.0, 0.0, 0.0, ...],
    'Temp': [25.1, 26.4, np.nan, ...],
    'Humidity': [99.0, np.nan, 96.0, ...],
    'Wind': [0.0, 0.0, 0.0, ...],
    'Outlook': ['rainy', 'sunny', 'rainy', ...],
    'Crowdedness': [0.14, np.nan, 0.21, ...]
}

df = pd.DataFrame(data)

Six Imputation Methods

  1. Listwise Deletion

    Involves removing rows with missing data, useful when missingness is minimal and MCAR.

df_clean = df[df.isnull().sum(axis=1) < 4].copy()
  1. Simple Imputation (Mean & Mode)

    Mean Imputation: Used for numerical variables.

    Mode Imputation: Used for categorical variables.

df_clean['Humidity'] = df_clean['Humidity'].fillna(df_clean['Humidity'].mean())
df_clean['Holiday'] = df_clean['Holiday'].fillna(df_clean['Holiday'].mode()[0])
  1.  Linear Interpolation

    Estimates missing values by assuming a linear relationship between known data points.

df_clean['Temp'] = df_clean['Temp'].interpolate(method='linear')
  1. Forward/Backward Fill

    Fills missing values based on neighboring values (last known or next).

df_clean['Outlook'] = df_clean['Outlook'].fillna(method='ffill').fillna(method='bfill')
  1. Constant Value Imputation

    Fills missing values with a specified constant (e.g., -1 for missing wind).

df_clean['Wind'] = df_clean['Wind'].fillna(-1)
  1. KNN Imputation

    Estimates missing values based on similar samples using K-Nearest Neighbors (KNN).

from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=3)
df_clean['Crowdedness'] = knn_imputer.fit_transform(df_clean[['Crowdedness']])

Different imputation methods serve different purposes, and choosing the right method requires understanding your data. No method is perfect, and each comes with trade-offs, such as potential biases or data distortion.

Top 5 AI Tools for Note-Taking

  1. Notta:

    • Supports 58 languages.

    • Real-time transcription and AI-powered summaries.

    • Integrates with Zoom, Google Meet, and Teams.

    • Free plan; Pro at $9/month.

  2. Otter:

    • Speaker identification and AI-based summaries.

    • Only supports English.

    • Free version available; Pro at $8.33/month.

  3. Rev:

    • 99% accuracy with AI and human transcription.

    • Complies with FCC/ADA regulations.

    • Free for 45 minutes; manual transcription at $1.50/minute.

  4. Gong:

    • Best for sales coaching and customer insights.

    • Integrates with Zoom, Teams.

    • No fixed pricing; customized rates.

  5. Avoma:

    • Designed for sales and customer success teams.

    • Integrates with CRM systems like Salesforce and HubSpot.

    • Free plan; Paid plans start at $24/user per month.

If you are interested in contributing to the newsletter, respond to this email. We are looking for contributions from you โ€” our readers to keep the community alive and going.