- Data Pragmatist
- Posts
- The Problem with ‘Dirty Data’
The Problem with ‘Dirty Data’
Nvidia releases free LLMs that match GPT-4 benchmarks


Welcome to learning edition of the Data Pragmatist, your dose of all things data science and AI.
📖 Estimated Reading Time: 5 minutes. Missed our previous editions?
🤔 OpenAI considers for-profit status LINK
- OpenAI CEO Sam Altman revealed that the company is considering becoming a for-profit benefit corporation, potentially facilitating an IPO and granting Altman a stake in the company. 
- Investors have been urging OpenAI to adopt this new status, which legally mandates balancing profit with societal and environmental responsibilities, potentially boosting the company's growth and market presence. 
- Switching to a benefit corporation status could ease OpenAI's path to an IPO, placing it on similar standing with organizations such as Anthropic and xAI. 
🔮 Nvidia releases free LLMs that match GPT-4 benchmarks LINK
- Nvidia has unveiled Nemotron-4 340B, an open-source pipeline for generating synthetic data that aims to help developers create high-quality datasets for training large language models for commercial purposes. 
- The Nemotron-4 340B family includes base, instruction, and reward models, and it has shown performance comparable to or better than GPT-4 in various benchmarks, including MT-Bench, MMLU, and HumanEval. 
- The models are optimized for inference using Nvidia NeMo and the Nvidia TensorRT-LLM library, and they are available under an open license for commercial use with all data accessible on Huggingface. 
🧠 The Problem with ‘Dirty Data’ — How Data Quality Can Impact Life Science AI Adoption
Data Quality Issues
Life science AI models often underperform due to poor-quality data. Despite vast data availability, much of it remains unstructured and of varying quality. This includes patient data from diverse online sources and genomic data, which has surged to over 40 exabytes but lacks necessary quality for effective AI training.

Regulatory and Compliance Hurdles
Data access is hindered by stringent regulations like GDPR and CCPA, limiting sharing and use in AI model training. Compliance ambiguity further complicates data utilization across different regions, resulting in incomplete datasets for model development.
Unclean and Inconsistent Data
A significant portion of life science data is 'dirty'—inaccurate, incomplete, or inconsistent—requiring extensive cleaning before use. This includes unstructured reports and varied electronic medical records, necessitating standardized formats for effective AI training.

Bias in AI Models
Biases in data influence AI model outcomes, potentially perpetuating biases in decision-making processes. Machine learning models trained on biased datasets may exacerbate existing biases, impacting objectivity in critical life science applications.
Path to Effective AI Adoption
Successful AI adoption in life science demands a robust data strategy focused on quality over quantity. Companies must prioritize data cleaning and harmonization efforts to ensure AI models are built on reliable foundations. This approach mitigates risks associated with biased decision-making and regulatory non-compliance, paving the way for AI to deliver substantial business value in enhancing patient outcomes and operational efficiency.
Top 5 Data Analytics Tools
- ThoughtSpot - AI-powered natural language search 
- Interactive Liveboards and reporting 
- Seamless integration with cloud data sources 
- Strong security and governance controls 
 
- Mode - SQL, Python, and R Notebooks 
- Visual Explorer for interactive visualizations 
- Connectivity with multiple databases 
- AI-based assistance 
 
- Looker - Semantic layer capabilities 
- Data visualization and embedded analytics 
- Integration with Google Workspace 
- Multiple database support 
 
- Tableau - Advanced data visualization 
- Drag-and-drop interface 
- Native data connectors 
- Data preparation and exploration features 
 
- Sisense - Traditional dashboard-based reporting 
- AI-powered exploration and explanations 
- Integration with cloud data sources 
- Low-code interface with APIs and SDKs 
 
If you are interested in contributing to the newsletter, respond to this email. We are looking for contributions from you — our readers to keep the community alive and going.