- Data Pragmatist
- Posts
- Missing Value Imputation, Explained
Missing Value Imputation, Explained
Google and Microsoft underreport emissions by 662%
Welcome to learning edition of the Data Pragmatist, your dose of all things data science and AI.
đź“– Estimated Reading Time: 5 minutes. Missed our previous editions?
🌍 Google and Microsoft underreport emissions by 662% LINK
According to The Guardian, actual greenhouse gas emissions from company-owned data centers of Google, Microsoft, Meta, and Apple are about 662% higher than their official reports between 2020 and 2022.
Amazon was identified as the largest emitter among these tech firms in 2022, with its emissions being more than twice those of Apple.
If the emissions from these five tech giants were combined, they would rank as the 33rd highest-emitting nation, according to industry experts.
đź‘€ Elon Musk and Larry Ellison begged Nvidia CEO Jensen Huang for AI GPUs LINK
Oracle co-founder Larry Ellison revealed he and Elon Musk begged Nvidia CEO Jensen Huang for more AI chips during a dinner meeting, citing high demand for graphics processing units.
Ellison emphasized the urgency by repeatedly asking Nvidia to take more money, a plea which he said was successfully received during an Oracle investor event last week.
Ellison's fortune has grown significantly as Oracle's cloud business thrives, with the company's shares rising almost 61% this year and indicating increased demand for Nvidia GPU clusters to support AI models.
Looking for unbiased, fact-based news? Join 1440 today.
Upgrade your news intake with 1440! Dive into a daily newsletter trusted by millions for its comprehensive, 5-minute snapshot of the world's happenings. We navigate through over 100 sources to bring you fact-based news on politics, business, and culture—minus the bias and absolutely free.
🧠Missing Value Imputation, Explained
The problem of missing values in data is one every data scientist or analyst will encounter. Let’s explore six methods to tackle this issue, showing how different approaches can be applied to impute missing values based on the nature of your data.
What Are Missing Values?
Missing values, often represented as NaN or NULL, refer to the absence of data in a dataset. They arise from various reasons, like data entry errors, sensor malfunctions, survey non-responses, and more.
Why Do Missing Values Occur?
Here are some common reasons:
Data Entry Errors: Someone might forget to input a value.
Sensor Malfunctions: In scientific experiments, sensors may fail.
Survey Non-Response: Respondents skip questions they don’t wish to answer.
Merged Datasets: Missing entries may occur when combining multiple data sources.
Data Corruption: Values can be corrupted during transfer.
Sampling Issues: Some data points might be missed during collection.
Types of Missing Data
MCAR (Missing Completely at Random): Missingness is unrelated to other data.
MAR (Missing at Random): Missingness depends on other observed variables.
MNAR (Missing Not at Random): The missingness depends on the value of the missing data itself.
The Dataset
We’re working with a golf course dataset with various columns like Date
, Weekday
, Holiday
, Temp
, Humidity
, Wind
, Outlook
, and Crowdedness
. It contains several missing values, making it ideal for exploring multiple imputation methods.
import pandas as pd
import numpy as np
data = {
'Date': ['08-01', '08-02', '08-03', ...],
'Weekday': [0, 1, 2, ...],
'Holiday': [0.0, 0.0, 0.0, ...],
'Temp': [25.1, 26.4, np.nan, ...],
'Humidity': [99.0, np.nan, 96.0, ...],
'Wind': [0.0, 0.0, 0.0, ...],
'Outlook': ['rainy', 'sunny', 'rainy', ...],
'Crowdedness': [0.14, np.nan, 0.21, ...]
}
df = pd.DataFrame(data)
Six Imputation Methods
Listwise Deletion
Involves removing rows with missing data, useful when missingness is minimal and MCAR.
df_clean = df[df.isnull().sum(axis=1) < 4].copy()
Simple Imputation (Mean & Mode)
Mean Imputation: Used for numerical variables.
Mode Imputation: Used for categorical variables.
df_clean['Humidity'] = df_clean['Humidity'].fillna(df_clean['Humidity'].mean())
df_clean['Holiday'] = df_clean['Holiday'].fillna(df_clean['Holiday'].mode()[0])
Linear Interpolation
Estimates missing values by assuming a linear relationship between known data points.
df_clean['Temp'] = df_clean['Temp'].interpolate(method='linear')
Forward/Backward Fill
Fills missing values based on neighboring values (last known or next).
df_clean['Outlook'] = df_clean['Outlook'].fillna(method='ffill').fillna(method='bfill')
Constant Value Imputation
Fills missing values with a specified constant (e.g., -1 for missing wind).
df_clean['Wind'] = df_clean['Wind'].fillna(-1)
KNN Imputation
Estimates missing values based on similar samples using K-Nearest Neighbors (KNN).
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=3)
df_clean['Crowdedness'] = knn_imputer.fit_transform(df_clean[['Crowdedness']])
Different imputation methods serve different purposes, and choosing the right method requires understanding your data. No method is perfect, and each comes with trade-offs, such as potential biases or data distortion.
Top 5 AI Tools for Note-Taking
Notta:
Supports 58 languages.
Real-time transcription and AI-powered summaries.
Integrates with Zoom, Google Meet, and Teams.
Free plan; Pro at $9/month.
Otter:
Speaker identification and AI-based summaries.
Only supports English.
Free version available; Pro at $8.33/month.
Rev:
99% accuracy with AI and human transcription.
Complies with FCC/ADA regulations.
Free for 45 minutes; manual transcription at $1.50/minute.
Gong:
Best for sales coaching and customer insights.
Integrates with Zoom, Teams.
No fixed pricing; customized rates.
Avoma:
Designed for sales and customer success teams.
Integrates with CRM systems like Salesforce and HubSpot.
Free plan; Paid plans start at $24/user per month.
If you are interested in contributing to the newsletter, respond to this email. We are looking for contributions from you — our readers to keep the community alive and going.