Skip to main content

Command Palette

Search for a command to run...

Explore and Transform Your Data: A PandasAI + Datatune Tutorial

Updated
5 min read

Heart Disease Risk Analysis with PandasAI and Datatune

Data analysis workflows have changed significantly with the introduction of Large Language Models. Instead of writing complex pandas operations, we can now use natural language to explore data and perform intelligent transformations. This tutorial demonstrates how to build a heart disease risk analyzer using PandasAI for conversational data exploration and Datatune for AI-powered data enrichment.

PandasAI Github: https://github.com/sinaptik-ai/pandas-ai/

Datatune Github: https://github.com/vitalops/datatune

What We'll Build

Our heart disease analyzer will:

  • Load medical data and explore it using natural language queries

  • Use AI to generate risk assessments and patient categories

  • Filter high-risk patients for targeted analysis

  • Save structured insights for healthcare teams

Prerequisites and Setup

Install the required packages:

pip install pandas pandasai datatune dask python-dotenv

Create a .env file with your Azure OpenAI credentials:

AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_API_BASE=https://your-resource.openai.azure.com/
AZURE_API_VERSION=2025-01-01-preview

Building the Heart Disease Analyzer

1. Initial Setup and Configuration

#!/usr/bin/env python3
import pandas as pd
import pandasai as pai
import datatune as dt
import dask.dataframe as dd
from datatune.llm.llm import Azure
from pandasai_openai import AzureOpenAI
from dotenv import load_dotenv
import os
import json

load_dotenv()

class HeartDiseaseAnalyzer:
    def __init__(self):
        api_key = os.getenv("AZURE_OPENAI_API_KEY")
        api_base = os.getenv("AZURE_API_BASE")
        api_version = os.getenv("AZURE_API_VERSION", "2025-01-01-preview")

        self.llm = Azure(
            model_name="gpt-4o-mini",
            api_key=api_key,
            api_base=api_base,
            api_version=api_version,
        )

        pandas_llm = AzureOpenAI(
            api_key=api_key,
            azure_endpoint=api_base,
            api_version=api_version,
            deployment_name='gpt-4o-mini'
        )

        pai.config.set({"llm": pandas_llm})

        self.df = None
        self.pai_df = None
        self.enriched_df = None
        self.high_risk_df = None
        self.agent = None

This setup creates two separate LLM instances: one for Datatune transformations and another for PandasAI exploration. We are using Azure OpenAI instances for both tools.

2. Data Loading with PandasAI Integration

def load_data(self, filepath="heart.csv"):
    self.df = pd.read_csv(filepath)
    self.pai_df = pai.DataFrame(self.df)
    self.agent = pai.Agent(self.pai_df)
    return self.df

We load data into both regular pandas and PandasAI DataFrames. The pai.Agent creates a conversational interface that allows us to ask questions about our data in plain English while maintaining access to traditional pandas operations.

3. Conversational Data Exploration

def explore_data(self):
    questions = [
        "What's the distribution of heart disease cases?",
        "What's the average age of patients with and without heart disease?",
        "Which chest pain types are most common in heart disease patients?",
        "What's the correlation between cholesterol levels and heart disease?"
    ]

    insights = {}
    for question in questions:
        response = self.pai_df.chat(question)
        insights[question] = str(response)

    with open("exploration_insights.json", "w") as f:
        json.dump(insights, f, indent=2)

    return insights

Instead of writing complex code in Python, we ask direct questions about our data. PandasAI understands the context with the help of LLMS and generates appropriate statistical analysis. The results are saved as JSON for later review and reporting.

You can ask questions like:

  • "What percentage of patients have high blood pressure?"

  • "Which features correlate most strongly with heart disease?"

4. Intelligent Data Enrichment with Datatune

def enrich_data(self):
    dask_df = dd.from_pandas(self.df, npartitions=4)

    risk_profiled = dt.Map(
        prompt=(
            "For each patient, analyze their health metrics and assign:\n"
            "- `severity_score`: calculate a numeric score from 1-10 based on their combined risk factors (age, cholesterol, BP, etc.)\n"
            "- `age_category`: categorize age into [young (<40), middle-aged (40–60), senior (>60)]\n"
            "- `risk_profile`: based on their overall health indicators, assign [low-risk, moderate-risk, high-risk]\n"
            "Consider all available health metrics for comprehensive assessment."
        ),
        output_fields=["severity_score", "age_category", "risk_profile"]
    )(self.llm, dask_df)

    high_risk_patients = dt.Filter(
        prompt=(
            "Keep only patients who have either high severity scores (7+) or are classified as high-risk profile and are middle-aged."
        )
    )(self.llm, risk_profiled)

    self.enriched_df = dt.finalize(risk_profiled.compute())
    self.high_risk_df = dt.finalize(high_risk_patients.compute())
    return self.enriched_df

This is where Datatune shows its strength. The Map operation creates new columns (and can also replace existing columns) by analyzing multiple health metrics simultaneously. The AI understands context and generates appropriate risk assessments based on medical terminologies. The Filter operation uses natural language to apply complex filtering logic that would require contextual understanding of the data.

The use of Dask allows us to process larger datasets efficiently through distributed computing.

5. Advanced Analysis of Enriched Data

def analyze_enriched(self):
    enriched_pai_df = pai.DataFrame(self.enriched_df)

    questions = [
        "What's the distribution of severity scores across the dataset?",
        "How do risk profiles correlate with actual heart disease cases?",
        "Which age category has the highest proportion of high-risk profiles?"
    ]

    insights = {}
    for question in questions:
        response = enriched_pai_df.chat(question)
        insights[question] = str(response)

    with open("enriched_analysis_insights.json", "w") as f:
        json.dump(insights, f, indent=2)

    return insights

After enriching our data, we can explore the new features to validate our AI-generated insights. This step helps us understand whether our enrichment strategy is working effectively and reveals new patterns in the data.

6. Results

def save_results(self):
    print('saving results')
    self.enriched_df.to_csv("enriched_heart_data.csv", index=False)
    self.high_risk_df.to_csv("high_risk_patients.csv", index=False)

We save both the enriched dataset and the filtered high-risk patients for further analysis or use in downstream applications.

7. Putting It All Together

def main():
    analyzer = HeartDiseaseAnalyzer()
    analyzer.load_data()
    analyzer.explore_data()
    analyzer.enrich_data()
    analyzer.analyze_enriched()
    analyzer.save_results()

if __name__ == "__main__":
    main()

Key Benefits of This Approach

PandasAI Advantages:

  • Non-technical stakeholders can explore data using natural language

  • Rapid prototyping without complex pandas code

  • AI can identify patterns that might be missed

  • Automatic visualization generation

Datatune Advantages:

  • AI understands domain context for better feature engineering

  • Natural language expressions for complex data operations

  • Built on Dask for scalable processing

  • Reproducible transformations with clear prompts

Conclusion

The combination of PandasAI and Datatune represents a shift toward more intuitive data analysis, by helping improve data exploration across technical and non-technical teams while generating insights faster, and more deeply.

This approach is particularly valuable in several domains including, but not limited to, healthcare, finance, etc. The AI models can incorporate domain knowledge that would be difficult to encode in traditional rule-based systems.