Data Pragmatist
Posts
Duplicate Detection with Generative AI

Duplicate Detection with Generative AI

Apple rumored to launch AI-powered home device

July 08, 2024

Welcome to learning edition of the Data Pragmatist, your dose of all things data science and AI.

📖 Estimated Reading Time: 5 minutes. Missed our previous editions?

🍎 Apple rumored to launch AI-powered home device LINK

Apple is rumored to be developing a new home device that merges the functionalities of the HomePod and Apple TV, supported by "Apple Intelligence" and potentially featuring the upcoming A18 chip, according to recent code discoveries.
Identified as "HomeAccessory17,1," this device is expected to include a speaker and LCD screen, positioning it to compete with Amazon's Echo Show and Google's Nest series.
The smart device is anticipated to serve as a smart home hub, allowing users to control HomeKit devices, and it may integrate advanced AI features announced for iOS 18, iPadOS 18, and macOS Sequoia, including capabilities powered by OpenAI's GPT-4 to enhance Siri's responses.

💥 Google considered blocking Safari users from accessing its new AI features LINK

Google considered limiting access to its new AI Overviews feature on Safari but ultimately decided not to follow through with the plan, according to a report by The Information.
The ongoing Justice Department investigation into Google's dominance in search highlights the company's arrangement with Apple, where Google pays around $20 billion annually to be the default search engine on iPhones.
Google has been trying to reduce its dependency on Safari by encouraging iPhone users to switch to its own apps, but the company has faced challenges due to Safari's pre-installed presence on Apple devices

🧠 Duplicate Detection with Generative AI

Customer data in CRM systems often suffers from replication issues, leading to complex downstream processes. Traditional NLP-based Entity Matching (EM) techniques address this by comparing records pairwise, but advancements in Large Language Models (LLMs) and Generative AI can significantly enhance duplicate detection and resolution. My research on benchmark datasets showed an improvement in de-duplication accuracy from 30% with traditional NLP to nearly 60% using my proposed method.

Traditional Approach

Data Preparation: This step involves cleaning data by removing non-ASCII characters, capitalizing, and tokenizing text to prepare it for NLP algorithms.
Candidate Generation: Candidate records are generated by combining all records, excluding self-comparisons. To reduce the number of candidates, blocking is employed.
Blocking: Blocking eliminates records that can't be duplicates due to different values in specific columns, such as city names in customer records.
Matching: Candidate records are compared using traditional NLP similarity metrics to identify potential matches.
Clustering: Matched records are grouped into clusters for easier identification.

Proposed Method

Create Match Sentences: Attributes of interest are concatenated into a single sentence, e.g., "John Hartley Smith 20 Main Street London."
Create Embedding Vectors: Match sentences are encoded into vector space using embedding models like all-mpnet-base-v2, resulting in 768-dimensional vectors.
Clustering: Embedding vectors are clustered using DBSCAN, utilizing distance metrics like L2 Norm and Cosine Similarity with an epsilon threshold. Clusters are formed by grouping records within the epsilon distance and sharing the same blocked column value.

Experiments and Results

Testing the proposed method on customer data and the Musicbrainz 200K benchmark dataset showed superior performance. Visualizations of clustering using UMAP reduction confirmed the effectiveness of this approach.

Top Data Influencers to Follow in 2024

Barr Moses
- Role: CEO and co-founder of Monte Carlo
- Specialty: Data reliability and data observability
- Notable Contributions: Co-author of "Data Quality Fundamentals", coined the term “data downtime”
- Platforms: Twitter
- Predictions for 2023: Available here
Ben Rogojan
- Alias: Seattle Data Guy
- Role: Data engineering solutions architect, consultant
- Specialty: Data architecture and statistics
- Background: Ex-Meta data engineer
- Platforms: LinkedIn (44K followers), YouTube (50K subscribers), Medium, Twitter
- Content: Tips on data engineering, industry trends
Benn Stancil
- Role: Co-founder and Chief Analytics Officer at Mode
- Specialty: Collaborative business intelligence and interactive data science
- Background: Ex-Microsoft senior analyst
- Platforms: Substack, Twitter
- Contributions: Industry discussions, particularly on the metrics layer in the modern data stack
Bruno Aziza
- Role: Head of Data and Analytics at Google Cloud
- Specialty: Big data analytics, leadership in tech
- Background: Leadership roles at Microsoft, Oracle, and Google
- Platforms: YouTube, Medium, Twitter
- Content: AI, data, and analytics news and trends
Cassie Kozyrkov
- Role: Chief Data Scientist at Google
- Specialty: Data science, decision intelligence
- Contributions: Founder of Decision Intelligence Engineering at Google, trained over 20K Googlers
- Platforms: Widely recognized as a top advisor for AI and data-driven initiatives

How did you like today's email?

❤️ Loved it | 💪 Pretty good | 💢 Could do better

If you are interested in contributing to the newsletter, respond to this email. We are looking for contributions from you — our readers to keep the community alive and going.

id: 2024-07-04-06:44:38:641t