Explore and Transform Your Data: A PandasAI + Datatune Tutorial
Heart Disease Risk Analysis with PandasAI and Datatune
Data analysis workflows have changed significantly with the introduction of Large Language Models. Instead of writing complex pandas operations, we can now use natural language to explore data and perform intelligent transformations. This tutorial demonstrates how to build a heart disease risk analyzer using PandasAI for conversational data exploration and Datatune for AI-powered data enrichment.
PandasAI Github: https://github.com/sinaptik-ai/pandas-ai/
Datatune Github: https://github.com/vitalops/datatune
What We'll Build
Our heart disease analyzer will:
Load medical data and explore it using natural language queries
Use AI to generate risk assessments and patient categories
Filter high-risk patients for targeted analysis
Save structured insights for healthcare teams
Prerequisites and Setup
Install the required packages:
pip install pandas pandasai datatune dask python-dotenv
Create a .env file with your Azure OpenAI credentials:
AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_API_BASE=https://your-resource.openai.azure.com/
AZURE_API_VERSION=2025-01-01-preview
Building the Heart Disease Analyzer
1. Initial Setup and Configuration
#!/usr/bin/env python3
import pandas as pd
import pandasai as pai
import datatune as dt
import dask.dataframe as dd
from datatune.llm.llm import Azure
from pandasai_openai import AzureOpenAI
from dotenv import load_dotenv
import os
import json
load_dotenv()
class HeartDiseaseAnalyzer:
def __init__(self):
api_key = os.getenv("AZURE_OPENAI_API_KEY")
api_base = os.getenv("AZURE_API_BASE")
api_version = os.getenv("AZURE_API_VERSION", "2025-01-01-preview")
self.llm = Azure(
model_name="gpt-4o-mini",
api_key=api_key,
api_base=api_base,
api_version=api_version,
)
pandas_llm = AzureOpenAI(
api_key=api_key,
azure_endpoint=api_base,
api_version=api_version,
deployment_name='gpt-4o-mini'
)
pai.config.set({"llm": pandas_llm})
self.df = None
self.pai_df = None
self.enriched_df = None
self.high_risk_df = None
self.agent = None
This setup creates two separate LLM instances: one for Datatune transformations and another for PandasAI exploration. We are using Azure OpenAI instances for both tools.
2. Data Loading with PandasAI Integration
def load_data(self, filepath="heart.csv"):
self.df = pd.read_csv(filepath)
self.pai_df = pai.DataFrame(self.df)
self.agent = pai.Agent(self.pai_df)
return self.df
We load data into both regular pandas and PandasAI DataFrames. The pai.Agent creates a conversational interface that allows us to ask questions about our data in plain English while maintaining access to traditional pandas operations.
3. Conversational Data Exploration
def explore_data(self):
questions = [
"What's the distribution of heart disease cases?",
"What's the average age of patients with and without heart disease?",
"Which chest pain types are most common in heart disease patients?",
"What's the correlation between cholesterol levels and heart disease?"
]
insights = {}
for question in questions:
response = self.pai_df.chat(question)
insights[question] = str(response)
with open("exploration_insights.json", "w") as f:
json.dump(insights, f, indent=2)
return insights
Instead of writing complex code in Python, we ask direct questions about our data. PandasAI understands the context with the help of LLMS and generates appropriate statistical analysis. The results are saved as JSON for later review and reporting.
You can ask questions like:
"What percentage of patients have high blood pressure?"
"Which features correlate most strongly with heart disease?"
4. Intelligent Data Enrichment with Datatune
def enrich_data(self):
dask_df = dd.from_pandas(self.df, npartitions=4)
risk_profiled = dt.Map(
prompt=(
"For each patient, analyze their health metrics and assign:\n"
"- `severity_score`: calculate a numeric score from 1-10 based on their combined risk factors (age, cholesterol, BP, etc.)\n"
"- `age_category`: categorize age into [young (<40), middle-aged (40–60), senior (>60)]\n"
"- `risk_profile`: based on their overall health indicators, assign [low-risk, moderate-risk, high-risk]\n"
"Consider all available health metrics for comprehensive assessment."
),
output_fields=["severity_score", "age_category", "risk_profile"]
)(self.llm, dask_df)
high_risk_patients = dt.Filter(
prompt=(
"Keep only patients who have either high severity scores (7+) or are classified as high-risk profile and are middle-aged."
)
)(self.llm, risk_profiled)
self.enriched_df = dt.finalize(risk_profiled.compute())
self.high_risk_df = dt.finalize(high_risk_patients.compute())
return self.enriched_df
This is where Datatune shows its strength. The Map operation creates new columns (and can also replace existing columns) by analyzing multiple health metrics simultaneously. The AI understands context and generates appropriate risk assessments based on medical terminologies. The Filter operation uses natural language to apply complex filtering logic that would require contextual understanding of the data.
The use of Dask allows us to process larger datasets efficiently through distributed computing.
5. Advanced Analysis of Enriched Data
def analyze_enriched(self):
enriched_pai_df = pai.DataFrame(self.enriched_df)
questions = [
"What's the distribution of severity scores across the dataset?",
"How do risk profiles correlate with actual heart disease cases?",
"Which age category has the highest proportion of high-risk profiles?"
]
insights = {}
for question in questions:
response = enriched_pai_df.chat(question)
insights[question] = str(response)
with open("enriched_analysis_insights.json", "w") as f:
json.dump(insights, f, indent=2)
return insights
After enriching our data, we can explore the new features to validate our AI-generated insights. This step helps us understand whether our enrichment strategy is working effectively and reveals new patterns in the data.
6. Results
def save_results(self):
print('saving results')
self.enriched_df.to_csv("enriched_heart_data.csv", index=False)
self.high_risk_df.to_csv("high_risk_patients.csv", index=False)
We save both the enriched dataset and the filtered high-risk patients for further analysis or use in downstream applications.
7. Putting It All Together
def main():
analyzer = HeartDiseaseAnalyzer()
analyzer.load_data()
analyzer.explore_data()
analyzer.enrich_data()
analyzer.analyze_enriched()
analyzer.save_results()
if __name__ == "__main__":
main()
Key Benefits of This Approach
PandasAI Advantages:
Non-technical stakeholders can explore data using natural language
Rapid prototyping without complex pandas code
AI can identify patterns that might be missed
Automatic visualization generation
Datatune Advantages:
AI understands domain context for better feature engineering
Natural language expressions for complex data operations
Built on Dask for scalable processing
Reproducible transformations with clear prompts
Conclusion
The combination of PandasAI and Datatune represents a shift toward more intuitive data analysis, by helping improve data exploration across technical and non-technical teams while generating insights faster, and more deeply.
This approach is particularly valuable in several domains including, but not limited to, healthcare, finance, etc. The AI models can incorporate domain knowledge that would be difficult to encode in traditional rule-based systems.


