- Data Pragmatist
- Posts
- Duplicate Detection with Generative AI
Duplicate Detection with Generative AI
Apple rumored to launch AI-powered home device
Welcome to learning edition of the Data Pragmatist, your dose of all things data science and AI.
📖 Estimated Reading Time: 5 minutes. Missed our previous editions?
🍎 Apple rumored to launch AI-powered home device LINK
Apple is rumored to be developing a new home device that merges the functionalities of the HomePod and Apple TV, supported by "Apple Intelligence" and potentially featuring the upcoming A18 chip, according to recent code discoveries.
Identified as "HomeAccessory17,1," this device is expected to include a speaker and LCD screen, positioning it to compete with Amazon's Echo Show and Google's Nest series.
The smart device is anticipated to serve as a smart home hub, allowing users to control HomeKit devices, and it may integrate advanced AI features announced for iOS 18, iPadOS 18, and macOS Sequoia, including capabilities powered by OpenAI's GPT-4 to enhance Siri's responses.
💥 Google considered blocking Safari users from accessing its new AI features LINK
Google considered limiting access to its new AI Overviews feature on Safari but ultimately decided not to follow through with the plan, according to a report by The Information.
The ongoing Justice Department investigation into Google's dominance in search highlights the company's arrangement with Apple, where Google pays around $20 billion annually to be the default search engine on iPhones.
Google has been trying to reduce its dependency on Safari by encouraging iPhone users to switch to its own apps, but the company has faced challenges due to Safari's pre-installed presence on Apple devices
🧠 Duplicate Detection with Generative AI
Customer data in CRM systems often suffers from replication issues, leading to complex downstream processes. Traditional NLP-based Entity Matching (EM) techniques address this by comparing records pairwise, but advancements in Large Language Models (LLMs) and Generative AI can significantly enhance duplicate detection and resolution. My research on benchmark datasets showed an improvement in de-duplication accuracy from 30% with traditional NLP to nearly 60% using my proposed method.
Traditional Approach
Data Preparation: This step involves cleaning data by removing non-ASCII characters, capitalizing, and tokenizing text to prepare it for NLP algorithms.
Candidate Generation: Candidate records are generated by combining all records, excluding self-comparisons. To reduce the number of candidates, blocking is employed.
Blocking: Blocking eliminates records that can't be duplicates due to different values in specific columns, such as city names in customer records.
Matching: Candidate records are compared using traditional NLP similarity metrics to identify potential matches.
Clustering: Matched records are grouped into clusters for easier identification.
Proposed Method
Create Match Sentences: Attributes of interest are concatenated into a single sentence, e.g., "John Hartley Smith 20 Main Street London."
Create Embedding Vectors: Match sentences are encoded into vector space using embedding models like all-mpnet-base-v2, resulting in 768-dimensional vectors.
Clustering: Embedding vectors are clustered using DBSCAN, utilizing distance metrics like L2 Norm and Cosine Similarity with an epsilon threshold. Clusters are formed by grouping records within the epsilon distance and sharing the same blocked column value.
Experiments and Results
Testing the proposed method on customer data and the Musicbrainz 200K benchmark dataset showed superior performance. Visualizations of clustering using UMAP reduction confirmed the effectiveness of this approach.
Top Data Influencers to Follow in 2024
Barr Moses
Role: CEO and co-founder of Monte Carlo
Specialty: Data reliability and data observability
Notable Contributions: Co-author of "Data Quality Fundamentals", coined the term “data downtime”
Platforms: Twitter
Predictions for 2023: Available here
Ben Rogojan
Alias: Seattle Data Guy
Role: Data engineering solutions architect, consultant
Specialty: Data architecture and statistics
Background: Ex-Meta data engineer
Platforms: LinkedIn (44K followers), YouTube (50K subscribers), Medium, Twitter
Content: Tips on data engineering, industry trends
Benn Stancil
Role: Co-founder and Chief Analytics Officer at Mode
Specialty: Collaborative business intelligence and interactive data science
Background: Ex-Microsoft senior analyst
Platforms: Substack, Twitter
Contributions: Industry discussions, particularly on the metrics layer in the modern data stack
Bruno Aziza
Role: Head of Data and Analytics at Google Cloud
Specialty: Big data analytics, leadership in tech
Background: Leadership roles at Microsoft, Oracle, and Google
Platforms: YouTube, Medium, Twitter
Content: AI, data, and analytics news and trends
Cassie Kozyrkov
Role: Chief Data Scientist at Google
Specialty: Data science, decision intelligence
Contributions: Founder of Decision Intelligence Engineering at Google, trained over 20K Googlers
Platforms: Widely recognized as a top advisor for AI and data-driven initiatives
How did you like today's email? |
If you are interested in contributing to the newsletter, respond to this email. We are looking for contributions from you — our readers to keep the community alive and going.
id: 2024-07-04-06:44:38:641t