Datatune

How to make LLMs work on large amounts of data

Abhijith Neil Abraham — Sat, 17 Jan 2026 01:24:25 GMT

Text to SQL tools have largely dominated the market of applying Intelligence over large amounts of data. However, with the advent of LLMs, this became a task dominated by several other tech, including RAG, Coding/SQL agents, etc.

One major issue with this is that LLMs cannot actually see the data, they only receive a rough abstraction of it, such as summaries, samples, schema descriptions, or partial slices generated by another system.

What happens when you have a large number of rows to process and feed them into an LLM?

Let's see how we can tackle this using Datatune! {% embed https://github.com/vitalops/datatune %}

The Context Length Problem

LLMs are becoming larger and larger in terms of context length capabilities. However, even with the current pace, a 100M token context length model is no match for the data that comes from an average database of a user.

This means, the data that needs to be transformed can be several orders of magnitude higher than an LLMs's context length.

Consider a mid-sized enterprise with the following very normal setup:

10 million rows in a transactional table
20 columns per row
Average 50 characters per column (IDs, text, timestamps, codes)

That’s:

10,000,000 rows × 20 columns × 50 characters = 10,000,000,000 characters

Even with aggressive tokenization (≈ 4 characters per token):

10,000,000,000 ÷ 4 ≈ 2.5 billion tokens

Now compare this with an extremely optimistic LLM context window of 100 million tokens.

That single table alone is 25× larger than the model’s entire context.

Solving Large Scale Data processing using Datatune

With Datatune, users can give full access to the data for LLMs, with the help of batch processing.

Each row of data is transformed using the input prompt, while this combination is sent to the LLM in a batch, and this process continues until all batches of data are sent. Datatune uses Dask's parallel processing abilities to split the data into partitions and use it to send parallel batches to the LLM.

Understanding Data Transformation Operations

There are 4 first-order data transformation functions (also known as primitives), namely MAP, FILTER, EXPAND, and REDUCE

Datatune is also built on top of these primitives, where each primitive can be performed with natural language operations.

Eg:

mapped = dt.map(
    prompt="Extract categories from the description and name of the product.",
    output_fields=["Category", "Subcategory"],
    input_fields=["Description", "Name"]
)(llm, df)

In the above example, a MAP operation is performed using a prompt to get the output fields Category and Subcategory from the input fields such as Description and Name.

Datatune also can be used to chain multiple transformations together.

Here's another example where a MAP and FILTER are used together

# First, extract sentiment and keywords from each review (MAP)
mapped = dt.map(
    prompt="Classify the sentiment and extract key topics from the review text.",
    input_fields=["review_text"],
    output_fields=["sentiment", "topics"]
)(llm, df)

# Then, keep only negative reviews for further analysis (FILTER)
filtered = dt.filter(
    prompt="Keep only rows where sentiment is negative."
)(llm, mapped)

Datatune Agents

Datatune has Agents which helps the user perform prompts without having to know what primitives to use. It is also helpful when a query is complex and requires multi step transformations chained together.

Here's an example where the previour MAP and FILTER operations that were chained together was solved with just a single prompt in Agents:

df = agent.do(
    """
    From product name and description, extract Category and Subcategory.
    Then keep only products that belong to the Electronics category
    and have a price greater than 100.
    """,
    df
)

The Agent also executes Python code along with row-level primitives (Map, Filter, etc). This is especially useful for some prompts that doesn't require row-level intelligence (numerical columns etc) as it can utilize Datatune's code generation capabilities to work on the data.

Data Sources

Datatune is designed to work with a wide variety of data sources including DataFrames and Databases. Users can use datatune with Ibis integration to help extend connectivity to Databases such as DuckDB, Postgres, MySQL, etc.

Contributing

We're building Datatune in open source, and we would love your contributions!

Check out the Github repository here:

Repo URL: https://github.com/vitalops/datatune

Simplify the job for your Data Teams using Datatune

Abhijith Neil Abraham — Thu, 20 Nov 2025 03:25:19 GMT

Data Engineering is hard, and finding the right data engineers for building your data pipelines is even harder.

In every modern company, data flows from dozens of directions: product analytics, billing systems, internal APIs, marketing platforms, cloud storage, user logs, and more.
All of this data is supposed to power dashboards, decision-making, and AI workflows, where non-technical teams query and view dashboards with the data, and technical wrangle with data transformations.

There are tons of tools that attempt to solve this using Agents that generate code, or SQL that runs on your data. Another way to access the data would be to implement a RAG pipeline on top of your data.

However, there is one problem. These solutions won’t fully understand the data.

Here’s an example:

Check out the following data with columns such as: Index, Customer ID, First Name, Last Name, Company, City, Country, Phone 1, Phone 2, Email, Subscription Date, Website

Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website
1,DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group,East Leonard,Chile,229.077.5154,397.884.0519x718,zunigavanessa@smith.info,2020-08-24,http://www.stephenson.com/
2,1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry,East Jimmychester,Djibouti,5153435776,686-620-1820x944,vmata@colon.com,2021-04-23,http://www.hobbs.com/
3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,Antigua and Barbuda,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/
4,5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan",Bensonview,Dominican Republic,001-808-617-6467x12895,+1-813-324-8756,stanleyblackwell@benson.org,2020-06-02,http://www.good-lyons.com/
5,053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade",West Priscilla,Slovakia (Slovak Republic),001-234-203-0635x76146,001-199-446-3860x3486,colinalvarado@miles.net,2021-04-17,https://goodwin-ingram.com/
6,2d08FB17EE273F4,Aimee,Downs,Steele Group,Chavezborough,Bosnia and Herzegovina,(283)437-3886x88321,999-728-1637,louis27@gilbert.com,2020-02-25,http://www.berger.net/
7,EA4d384DfDbBf77,Darren,Peck,"Lester, Woodard and Mitchell",Lake Ana,Pitcairn Islands,(496)452-6181x3291,+1-247-266-0963x4995,tgates@cantrell.com,2021-08-24,https://www.le.com/
8,0e04AFde9f225dE,Brett,Mullen,"Sanford, Davenport and Giles",Kimport,Bulgaria,001-583-352-7197x297,001-333-145-0369,asnow@colon.com,2021-04-12,https://hammond-ramsey.com/
9,C2dE4dEEc489ae0,Sheryl,Meyers,Browning-Simon,Robersonstad,Cyprus,854-138-4911x5772,+1-448-910-2276x729,mariokhan@ryan-pope.org,2020-01-13,https://www.bullock.net/
10,8C2811a503C7c5a,Michelle,Gallagher,Beck-Hendrix,Elaineberg,Timor-Leste,739.218.2516x459,001-054-401-0347x617,mdyer@escobar.net,2021-11-08,https://arias.com/

We will try to apply some filters and also anonymise the personally identifiable information of only women in the data.

prompt = '''
Filter location for the American Continent.
Anonymise personally identifiable information of only women in the data by marking them as anonymised
'''

And this is how the data should look like after the above two operations mentioned in the prompt:

Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website,Is_American
0,1,DD37Cf93aecA6Dc,ANONYMIZED,ANONYMIZED,Rasmussen Group,East Leonard,Chile,ANONYMIZED,ANONYMIZED,ANONYMIZED,2020-08-24,http://www.stephenson.com/,True
2,3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,Antigua and Barbuda,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/,True
3,4,5Cef8BFA16c5e3c,ANONYMIZED,ANONYMIZED,"Dominguez, Mcmillan and Donovan",Bensonview,Dominican Republic,ANONYMIZED,ANONYMIZED,ANONYMIZED,2020-06-02,http://www.good-lyons.com/,True

So the two operations here requires understanding each row of the data, where filtering my country requires making the pipeline look at the address field, and anonymising female names require classifying which names are female in the data.

Traditional Code/SQL agents will not understand each row of the data, and RAG is not explicitly equipped to work with such data structures and respond with the same structure.

This is where Datatune can help you better.

Datatune’s agents can automatically pick up the right operations required for the transformation, and then feed the LLM with the full data and handle the rate limits, and context window issues for you, all in a scalable fashion!

This means that a regular software engineer new to data engineering can easily pick up this job using Datatune Agents to build their Agentic software.

Here’s the full code example to set up Datatune Agents and perform the example above:

Example: https://github.com/vitalops/datatune/blob/main/examples/data_anonymization.ipynb

ScrapeGraphAI + Datatune, scrape the web, transform the data with Datatune

Abhijith Neil Abraham — Sun, 20 Jul 2025 23:12:52 GMT

I was browsing Amazon for a new laptop, getting overwhelmed by the hundreds of options and confusing specifications. After spending two hours manually copying specs into a spreadsheet, I decided there had to be a better approach. That's when I remembered reading about ScrapeGraphAI and thought it might be worth trying.

The Problem with Manual Research

Shopping for laptops online presents several challenges:

Hundreds of models with varying specifications
Prices that change frequently
Reviews are scattered across different sections
Difficult comparison between similar models

Like many people, I started by opening multiple browser tabs and manually collecting information. This process was time-consuming and prone to errors.

What is ScrapeGraphAI?

ScrapeGraphAI is a Python library that uses LLMs to extract data from websites and documents. Instead of writing complex scraping code, you describe what information you want in plain language, and the library handles the extraction.

Key features include:

Natural language instructions: No need for complex CSS selectors or XPath
Multiple LLM support: Works with GPT, Gemini, Groq, Azure, and local models via Ollama
Flexible formats: Handles XML, HTML, JSON, and Markdown documents

Using ScrapeGraphAI for Laptop Research

Here's how I used it to extract laptop information from Amazon:

Setup

pip install scrapegraphai
playwright install

Basic Implementation

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "gpt-3.5-turbo",
        "api_key": "YOUR_API_KEY"
    },
    "verbose": True,
    "headless": False
}

smart_scraper = SmartScraperGraph(
    prompt="Extract laptop name, price, rating, key specifications, and availability from this Amazon search page",
    source="https://amazon.com/s?k=laptops+under+1500",
    config=graph_config
)

results = smart_scraper.run()

Processing the Output with Datatune

Once ScrapeGraphAI extracts the data, you can use datatune to clean and filter the results.

Install datatune using pip

pip install datatune

import os
import dask.dataframe as dd
import datatune as dt
from datatune.llm.llm import OpenAI
import json
import pandas as pd

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
llm = OpenAI(model_name="gpt-3.5-turbo", tpm=200000, rpm=50)

# Convert ScrapeGraphAI JSON output to DataFrame
with open('smartscraper-2025-07-19.json', 'r') as f:
    scrape_data = json.load(f)

# Extract laptops array and convert to DataFrame
laptops_df = pd.DataFrame(scrape_data['result']['laptops'])
df = dd.from_pandas(laptops_df, npartitions=2)

# Map operation to standardize and enrich data
mapped = dt.Map(
    prompt="Standardize the processor name and categorize it as budget, mid-range, or high-end based on performance",
    output_fields=["processor_category", "standardized_processor"],
    input_fields=["processor"]
)(llm, df)

# Filter for laptops with 16GB RAM and reasonable pricing
filtered = dt.Filter(
    prompt="Keep only laptops with 16GB RAM, price between $800-$2000, and battery life over 8 hours",
    input_fields=["ram", "price", "batteryLife"]
)(llm, mapped)

# Additional mapping for value analysis
value_mapped = dt.Map(
    prompt="Calculate value score based on price, specs, and battery life. Rate as excellent, good, fair, or poor value",
    output_fields=["value_rating", "value_score"],
    input_fields=["price", "ram", "ssd", "processor", "batteryLife", "weight"]
)(llm, filtered)

# Final cleanup and export
result = dt.finalize(value_mapped)
final_df = result.compute()
final_df.to_csv("filtered_laptops.csv", index=False)

print(f"Found {len(final_df)} laptops matching criteria")
print(final_df[['brand', 'model', 'price', 'value_rating']].head())

Alternative Processing Approach

For more specific filtering based on use cases:

# Filter for different use cases
productivity_laptops = dt.Filter(
    prompt="Keep laptops suitable for productivity work: good battery life, lightweight, reliable processor",
    input_fields=["batteryLife", "weight", "processor", "ram"]
)(llm, df)

gaming_laptops = dt.Filter(
    prompt="Keep gaming laptops: powerful processor, dedicated graphics implied by model name, adequate RAM",
    input_fields=["processor", "model", "ram", "price"]
)(llm, df)

budget_laptops = dt.Filter(
    prompt="Keep budget-friendly laptops under $1000 with decent specifications",
    input_fields=["price", "ram", "processor", "ssd"]
)(llm, df)

# Process each category
categories = {
    "productivity": productivity_laptops,
    "gaming": gaming_laptops, 
    "budget": budget_laptops
}

for category, filtered_df in categories.items():
    # Add category information
    categorized = dt.Map(
        prompt=f"Add category '{category}' and recommend this laptop with brief reasoning",
        output_fields=["category", "recommendation_reason"],
        input_fields=["brand", "model", "price", "processor", "ram"]
    )(llm, filtered_df)

    final_result = dt.finalize(categorized)
    final_result.compute().to_csv(f"{category}_laptops.csv", index=False)

Final Output

After running the complete pipeline (ScrapeGraphAI extraction + datatune processing), here's what the final dataset looks like:

brand,model,price,ram,ssd,processor,screenSize,batteryLife,weight,releaseYear,processor_category,standardized_processor,value_rating,value_score,category,recommendation_reason
Apple,MacBook Pro,1299,16,512,Apple M1,13.3,18,1.4,2020,high-end,Apple M1 8-core,excellent,9.2,productivity,"Outstanding battery life and performance for professional work, lightweight design ideal for mobile productivity"
Lenovo,ThinkPad X1 Carbon,1399,16,512,Intel Core i7,14,15,1.1,2021,high-end,Intel Core i7-1165G7,good,8.1,productivity,"Ultra-lightweight business laptop with excellent build quality and long battery life, perfect for professionals"
ASUS,TUF Gaming A15,999,8,512,AMD Ryzen 5,15.6,10,2.2,2021,mid-range,AMD Ryzen 5 4600H,good,7.8,gaming,"Solid gaming performance at affordable price point, good processor and adequate RAM for modern games"
Acer,Aspire 5,599,8,256,Intel Core i5,15.6,9,1.75,2021,mid-range,Intel Core i5-1135G7,excellent,8.9,budget,"Outstanding value for money with solid specs for everyday computing, good balance of price and performance"
Microsoft,Surface Laptop 4,999,8,256,AMD Ryzen 5,13.5,19,1.27,2021,mid-range,AMD Ryzen 5 4680U,good,8.0,productivity,"Premium build quality with exceptional battery life, ideal for students and professionals who value portability"

Available Pipeline Types

ScrapeGraphAI offers several scraping approaches:

SmartScraperGraph: Single-page extraction with user prompts
SearchGraph: Multi-page scraping from search results
SpeechGraph: Converts web content to audio files
ScriptCreatorGraph: Generates Python scripts for extracted data
SmartScraperMultiGraph: Handles multiple pages simultaneously

Practical Applications

Beyond laptop shopping, this approach works for:

E-commerce and Retail

Price monitoring across competitors
Product availability tracking
Review analysis
Market trend identification

Business Intelligence

Lead generation from directories
Competitor analysis
Market research automation
Social media monitoring

Research and Academia

Academic paper data extraction
Survey data collection
Content analysis
Dataset creation

Financial Services

Stock price tracking
Financial news monitoring
Economic indicator collection
Risk assessment data

Getting Started

To try ScrapeGraphAI yourself:

Installation

pip install scrapegraphai
playwright install

Basic Usage

from scrapegraphai.graphs import SmartScraperGraph

config = {
    "llm": {
        "model": "gpt-3.5-turbo", 
        "api_key": "your-api-key"
    }
}

scraper = SmartScraperGraph(
    prompt="Extract product names and prices",
    source="https://example-shop.com",
    config=config
)

data = scraper.run()
print(data)

Advanced Features

Use local models with Ollama for privacy
Handle JavaScript-heavy sites with headless browsers
Process multiple pages simultaneously
Generate audio summaries of web content

Results and Considerations

In my laptop research, ScrapeGraphAI reduced data collection time from hours to minutes. The structured JSON output made it easy to compare options and identify the best value laptops in my budget range.

You can check out more examples here for ScrapegraphAI: https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py/examples/sync

Conclusion

ScrapeGraphAI offers a practical solution for automated data extraction without requiring extensive programming knowledge. It significantly reduces the manual effort involved in gathering structured data from websites.

For anyone regularly collecting data from websites, whether for research, business intelligence, or personal projects, it's worth exploring. The natural language interface makes it accessible to non-programmers, while the flexibility supports more complex use cases.

The combination with datatune for post-processing provides a complete pipeline from raw web data to cleaned, categorized datasets ready for analysis.

Give us a star!

Datatune: https://github.com/vitalops/datatune

ScrapeGraphAI: https://github.com/ScrapeGraphAI/scrapegraph-sdk

Explore and Transform Your Data: A PandasAI + Datatune Tutorial

Abhijith Neil Abraham — Tue, 08 Jul 2025 12:45:54 GMT

Heart Disease Risk Analysis with PandasAI and Datatune

Data analysis workflows have changed significantly with the introduction of Large Language Models. Instead of writing complex pandas operations, we can now use natural language to explore data and perform intelligent transformations. This tutorial demonstrates how to build a heart disease risk analyzer using PandasAI for conversational data exploration and Datatune for AI-powered data enrichment.

PandasAI Github: https://github.com/sinaptik-ai/pandas-ai/

Datatune Github: https://github.com/vitalops/datatune

What We'll Build

Our heart disease analyzer will:

Load medical data and explore it using natural language queries
Use AI to generate risk assessments and patient categories
Filter high-risk patients for targeted analysis
Save structured insights for healthcare teams

Prerequisites and Setup

Install the required packages:

pip install pandas pandasai datatune dask python-dotenv

Create a .env file with your Azure OpenAI credentials:

AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_API_BASE=https://your-resource.openai.azure.com/
AZURE_API_VERSION=2025-01-01-preview

Building the Heart Disease Analyzer

1. Initial Setup and Configuration

#!/usr/bin/env python3
import pandas as pd
import pandasai as pai
import datatune as dt
import dask.dataframe as dd
from datatune.llm.llm import Azure
from pandasai_openai import AzureOpenAI
from dotenv import load_dotenv
import os
import json

load_dotenv()

class HeartDiseaseAnalyzer:
    def __init__(self):
        api_key = os.getenv("AZURE_OPENAI_API_KEY")
        api_base = os.getenv("AZURE_API_BASE")
        api_version = os.getenv("AZURE_API_VERSION", "2025-01-01-preview")

        self.llm = Azure(
            model_name="gpt-4o-mini",
            api_key=api_key,
            api_base=api_base,
            api_version=api_version,
        )

        pandas_llm = AzureOpenAI(
            api_key=api_key,
            azure_endpoint=api_base,
            api_version=api_version,
            deployment_name='gpt-4o-mini'
        )

        pai.config.set({"llm": pandas_llm})

        self.df = None
        self.pai_df = None
        self.enriched_df = None
        self.high_risk_df = None
        self.agent = None

This setup creates two separate LLM instances: one for Datatune transformations and another for PandasAI exploration. We are using Azure OpenAI instances for both tools.

2. Data Loading with PandasAI Integration

def load_data(self, filepath="heart.csv"):
    self.df = pd.read_csv(filepath)
    self.pai_df = pai.DataFrame(self.df)
    self.agent = pai.Agent(self.pai_df)
    return self.df

We load data into both regular pandas and PandasAI DataFrames. The pai.Agent creates a conversational interface that allows us to ask questions about our data in plain English while maintaining access to traditional pandas operations.

3. Conversational Data Exploration

def explore_data(self):
    questions = [
        "What's the distribution of heart disease cases?",
        "What's the average age of patients with and without heart disease?",
        "Which chest pain types are most common in heart disease patients?",
        "What's the correlation between cholesterol levels and heart disease?"
    ]

    insights = {}
    for question in questions:
        response = self.pai_df.chat(question)
        insights[question] = str(response)

    with open("exploration_insights.json", "w") as f:
        json.dump(insights, f, indent=2)

    return insights

Instead of writing complex code in Python, we ask direct questions about our data. PandasAI understands the context with the help of LLMS and generates appropriate statistical analysis. The results are saved as JSON for later review and reporting.

You can ask questions like:

"What percentage of patients have high blood pressure?"
"Which features correlate most strongly with heart disease?"

4. Intelligent Data Enrichment with Datatune

def enrich_data(self):
    dask_df = dd.from_pandas(self.df, npartitions=4)

    risk_profiled = dt.Map(
        prompt=(
            "For each patient, analyze their health metrics and assign:\n"
            "- `severity_score`: calculate a numeric score from 1-10 based on their combined risk factors (age, cholesterol, BP, etc.)\n"
            "- `age_category`: categorize age into [young (<40), middle-aged (40–60), senior (>60)]\n"
            "- `risk_profile`: based on their overall health indicators, assign [low-risk, moderate-risk, high-risk]\n"
            "Consider all available health metrics for comprehensive assessment."
        ),
        output_fields=["severity_score", "age_category", "risk_profile"]
    )(self.llm, dask_df)

    high_risk_patients = dt.Filter(
        prompt=(
            "Keep only patients who have either high severity scores (7+) or are classified as high-risk profile and are middle-aged."
        )
    )(self.llm, risk_profiled)

    self.enriched_df = dt.finalize(risk_profiled.compute())
    self.high_risk_df = dt.finalize(high_risk_patients.compute())
    return self.enriched_df

This is where Datatune shows its strength. The Map operation creates new columns (and can also replace existing columns) by analyzing multiple health metrics simultaneously. The AI understands context and generates appropriate risk assessments based on medical terminologies. The Filter operation uses natural language to apply complex filtering logic that would require contextual understanding of the data.

The use of Dask allows us to process larger datasets efficiently through distributed computing.

5. Advanced Analysis of Enriched Data

def analyze_enriched(self):
    enriched_pai_df = pai.DataFrame(self.enriched_df)

    questions = [
        "What's the distribution of severity scores across the dataset?",
        "How do risk profiles correlate with actual heart disease cases?",
        "Which age category has the highest proportion of high-risk profiles?"
    ]

    insights = {}
    for question in questions:
        response = enriched_pai_df.chat(question)
        insights[question] = str(response)

    with open("enriched_analysis_insights.json", "w") as f:
        json.dump(insights, f, indent=2)

    return insights

After enriching our data, we can explore the new features to validate our AI-generated insights. This step helps us understand whether our enrichment strategy is working effectively and reveals new patterns in the data.

6. Results

def save_results(self):
    print('saving results')
    self.enriched_df.to_csv("enriched_heart_data.csv", index=False)
    self.high_risk_df.to_csv("high_risk_patients.csv", index=False)

We save both the enriched dataset and the filtered high-risk patients for further analysis or use in downstream applications.

7. Putting It All Together

def main():
    analyzer = HeartDiseaseAnalyzer()
    analyzer.load_data()
    analyzer.explore_data()
    analyzer.enrich_data()
    analyzer.analyze_enriched()
    analyzer.save_results()

if __name__ == "__main__":
    main()

Key Benefits of This Approach

PandasAI Advantages:

Non-technical stakeholders can explore data using natural language
Rapid prototyping without complex pandas code
AI can identify patterns that might be missed
Automatic visualization generation

Datatune Advantages:

AI understands domain context for better feature engineering
Natural language expressions for complex data operations
Built on Dask for scalable processing
Reproducible transformations with clear prompts

Conclusion

The combination of PandasAI and Datatune represents a shift toward more intuitive data analysis, by helping improve data exploration across technical and non-technical teams while generating insights faster, and more deeply.

This approach is particularly valuable in several domains including, but not limited to, healthcare, finance, etc. The AI models can incorporate domain knowledge that would be difficult to encode in traditional rule-based systems.

Solving Complex Data pipelines with Composio + Datatune

Abhijith Neil Abraham — Wed, 11 Jun 2025 03:34:30 GMT

The Data Integration Challenge

Modern businesses rely on data scattered across dozens of platforms such as CRMs, project management tools, communication platforms, databases, and APIs. The traditional approach involves:

Complex API integrations with different authentication methods
Constant maintenance as APIs change and evolve
Complicated code to perform data transformations that requires semantic context understanding.

What if you could connect to any external service, pull data, and transform it using nothing but natural language? Let’s see how we can engineer a data pipeline using Composio and Datatune.

Composio : Integration Platform for AI Agents & LLMs**

Composio eliminates integration complexity by providing:

200+ pre-built integrations across every major platform (eg: Salesforce, GitHub, Slack, Google Sheets, Notion, etc**)**
One-click authentication handling OAuth, API keys, and complex flows
Unified interface that abstracts away API differences
Built for AI workflows with structured, consistent outputs.

Using Composeio can also help you connect MCPs to your AI Agents, so with very few steps you can avoid painful API orchestration, redundant boilerplate code, and platform-specific edge cases.

Datatune: Perform transformations on your data with natural language

One of the major complexities of data pipelines is transforming messy tabular data into clean, usable formats, especially when the transformation requires understanding the semantic meaning of the data and the task at hand.

Consider a sales spreadsheet with product names like “iPhone 15 Pro Max 256GB Blue”. Extracting just the color would normally require complex regex patterns to handle every variation. With Datatune, you simply say “Extract the color from product name” and it understands context automatically. You can then chain operations naturally: first extract colors and categories with Map, then Filter to “Keep only blue electronics,” and finally clean up the results.

This approach is powerful because each step builds on the previous one, letting you transform millions of rows by describing what you want rather than writing long lines of pandas or regex code for the same.

Real-World Example: Analyzing GitHub Issues

Let’s walk through a practical example where we analyze GitHub issues to help maintainers prioritize their work. We will fetch issues from pytorch’s Github repository using composio, and will process the data using datatune to find the issues that could be “good first contributions for new developers”. Here’s how simple it becomes with Composio + Datatune:

Install dependencies

Install both libraries (dask will be automatically installed with datatune) and dotenv for loading your environment variables:

pip install composio datatune python-dotenv

Setup and Configuration

We need API keys to interact with Composio. Go to https://app.composio.dev and login and get your API key.

For using LLMs with Datatune, you can either use OpenAI, or local models with the help of Ollama, or from any other API providers such as Azure. For more info about how to use different providers, refer to this link: https://docs.datatune.ai/LLM.html

For the sake of this article, we will use Azure OpenAI as the provider.

Once you’re ready with all the credentials, create a .env file and add your environment variables like this:

COMPOSIO_API_KEY=your-composio-key
AZURE_OPENAI_API_KEY=your-key
AZURE_API_BASE=https://your-endpoint.openai.azure.com/
AZURE_API_VERSION=2024-02-01

Let’s import the libraries:



import os
import pandas as pd
import dask.dataframe as dd
import datatune as dt
from composio import ComposioToolSet, App, Action
from datatune.core.map import Map
from datatune.core.filter import Filter
from datatune.llm.llm import Azure
from dotenv import load_dotenv
load_dotenv()


COMPOSIO_API_KEY = os.getenv("COMPOSIO_API_KEY")
api_key = os.getenv("AZURE_OPENAI_API_KEY")
api_base = os.getenv("AZURE_API_BASE")
api_version = os.getenv("AZURE_API_VERSION", "2024-02-01")

Connect to GitHub with Composio

We will use Composio to connect to the GitHub repository of Pytorch using ComposioToolset. Composio provides several actions that a user can perform from their integration choice. In our case, we can use the action GITHUB_LIST_REPOSITORY_ISSUES which will return us with the required data from Github issues of pytorch with the help of the following function.

Let’s get the issues from https://github.com/pytorch/pytorch, so set the repo owner as ‘pytorch’ and repo_name as ‘pytorch’

def fetch_github_issues(toolset, repo_owner="pytorch", repo_name="pytorch", limit=30):
    result = toolset.execute_action(
        action=Action.GITHUB_LIST_REPOSITORY_ISSUES,
        params={
            "owner": repo_owner,
            "repo": repo_name,
            "state": "open",
            "per_page": 30
        }
    )

    # Extract issues data from result
    issues_data = []

    if isinstance(result, dict) and result.get('successful'):
        data = result.get('data', {})

        if isinstance(data, list):
            issues_data = data
        elif isinstance(data, dict):
            # Look for issues in common response patterns
            for key in ['details', 'items', 'data', 'issues', 'results']:
                if key in data and isinstance(data[key], list):
                    issues_data = data[key]
                    break

            # Check if it's a single issue object
            if not issues_data and 'number' in data and 'title' in data:
                issues_data = [data]

    elif isinstance(result, list):
        issues_data = result

    if not isinstance(issues_data, list):
        return pd.DataFrame()

    # Process issues into DataFrame
    processed_issues = []
    for i, issue in enumerate(issues_data):
        if i >= limit:
            break

        if isinstance(issue, dict):
            processed_issues.append({
                "issue_number": issue.get("number"),
                "title": issue.get("title", ""),
                "issue_body": issue.get("body", "")[:500] if issue.get("body") else "",
                "state": issue.get("state", ""),
                "comments_count": issue.get("comments", 0),
                "labels": [label.get("name", "") for label in issue.get("labels", [])] if issue.get("labels") else [],
                "created_at": issue.get("created_at", ""),
                "updated_at": issue.get("updated_at", ""),
                "html_url": issue.get("html_url", "")
            })

    return pd.DataFrame(processed_issues)

Transform Data with Natural Language using Datatune

The result data from the above function contains the following columns: issue_number, title, issue_body, state, comments_count, labels, created_at, updated_at, html_url

Instead of complicated python code to make edits to this data, we will simply use Datatune.

We will perform two major operations chained together.

Map Operation: To Replace values or Add new columns based on existing data

In our case, we will perform the map operation primarily to classify issues into severity levels, estimated efforts, and issue types (bug or feature) and output this data into respective new columns.

2. Filter Operation: Remove Specific Rows

We will remove the rows that are not good first issues.

Let’s see how we can write prompts into datatune to perform these operations and chain them. We will use gpt-4.1-mini as the choice of LLM for both operations. Since Datatune uses Dask under the hood, we use the .compute() method on the dask dataframe to trigger the data transformation. In the end, we apply dt.finalize() to clear internal metadata that was created during this process .

def analyze_with_datatune(df):
    if df.empty:
        return pd.DataFrame(), pd.DataFrame()

    dask_df = dd.from_pandas(df, npartitions=1)

    llm = Azure(
        model_name="gpt-4.1-mini",
        api_key=api_key,
        api_base=api_base,
        api_version=api_version,
    )


    # Map operation: Analyze each issue
    mapped = Map(
        prompt="Based on the issue title, description, and labels, determine: 1) severity (high/medium/low) - consider critical bugs, memory leaks, crashes as high; 2) estimated effort to fix (high/medium/low); 3) issue type (bug/feature/documentation/other)",
        output_fields=["severity", "estimated_effort", "issue_type"]
    )(llm, dask_df)

    # Filter operation: Find high severity issues
    good_first_issues = Filter(
        prompt="Keep issues that look like they could be good first contributions for new developers"
    )(llm, mapped)


    return dt.finalize(good_first_issues.compute())

Let’s wrap everything up and take a look at the full code:

import os
import pandas as pd
import dask.dataframe as dd
from composio import ComposioToolSet, App, Action
from datatune.core.map import Map
from datatune.core.filter import Filter
from datatune.llm.llm import Azure
import datatune as dt
from dotenv import load_dotenv
load_dotenv()

# Configuration
COMPOSIO_API_KEY = os.getenv("COMPOSIO_API_KEY")
api_key = os.getenv("AZURE_OPENAI_API_KEY")
api_base = os.getenv("AZURE_API_BASE")
api_version = os.getenv("AZURE_API_VERSION", "2024-02-01")

def setup_composio():
    toolset = ComposioToolSet(api_key=COMPOSIO_API_KEY)
    return toolset

def fetch_github_issues(toolset, repo_owner="pytorch", repo_name="pytorch", limit=30):
    result = toolset.execute_action(
        action=Action.GITHUB_LIST_REPOSITORY_ISSUES,
        params={
            "owner": repo_owner,
            "repo": repo_name,
            "state": "open",
            "per_page": limit
        }
    )

    # Extract issues data from result
    issues_data = []

    if isinstance(result, dict) and result.get('successful'):
        data = result.get('data', {})

        if isinstance(data, list):
            issues_data = data
        elif isinstance(data, dict):
            # Look for issues in common response patterns
            for key in ['details', 'items', 'data', 'issues', 'results']:
                if key in data and isinstance(data[key], list):
                    issues_data = data[key]
                    break

            # Check if it's a single issue object
            if not issues_data and 'number' in data and 'title' in data:
                issues_data = [data]

    elif isinstance(result, list):
        issues_data = result

    if not isinstance(issues_data, list):
        return pd.DataFrame()

    # Process issues into DataFrame
    processed_issues = []
    for i, issue in enumerate(issues_data):
        if i >= limit:
            break

        if isinstance(issue, dict):
            processed_issues.append({
                "issue_number": issue.get("number"),
                "title": issue.get("title", ""),
                "issue_body": issue.get("body", "")[:500] if issue.get("body") else "",
                "state": issue.get("state", ""),
                "comments_count": issue.get("comments", 0),
                "labels": [label.get("name", "") for label in issue.get("labels", [])] if issue.get("labels") else [],
                "created_at": issue.get("created_at", ""),
                "updated_at": issue.get("updated_at", ""),
                "html_url": issue.get("html_url", "")
            })

    return pd.DataFrame(processed_issues)

def analyze_with_datatune(df):
    if df.empty:
        return pd.DataFrame(), pd.DataFrame()

    dask_df = dd.from_pandas(df, npartitions=1)

    llm = Azure(
        model_name="gpt-4.1-mini",
        api_key=api_key,
        api_base=api_base,
        api_version=api_version,
    )


    # Map operation: Analyze each issue
    mapped = Map(
        prompt="Based on the issue title, description, and labels, determine: 1) severity (high/medium/low) - consider critical bugs, memory leaks, crashes as high; 2) estimated effort to fix (high/medium/low); 3) issue type (bug/feature/documentation/other)",
        output_fields=["severity", "estimated_effort", "issue_type"]
    )(llm, dask_df)

    # Filter operation: Find high severity issues
    good_first_issues = Filter(
        prompt="Keep issues that look like they could be good first contributions for new developers"
    )(llm, mapped)

    final_df = good_first_issues.compute()
    return dt.finalize(final_df)


def main():
    toolset = setup_composio()
    issues_df = fetch_github_issues(toolset)

    good_first_issues = analyze_with_datatune(issues_df)
    if not good_first_issues.empty:
        good_first_issues.to_csv("good_first_issues.csv", index=False)
        print(f"  - good_first_issues.csv ({len(good_first_issues)} issues)")

if __name__ == "__main__":
    main()

The results in the good_first_issues.csv should look something like this:

issue_number,title,issue_body,state,comments_count,labels,created_at,updated_at,html_url,severity,estimated_effort,issue_type
45123,"Fix typo in torch.nn.functional documentation","There's a typo in the documentation for F.relu where 'activation' is misspelled as 'activaton'. This should be a simple fix...",open,2,"['good first issue', 'module: docs']",2025-01-15T14:22:31Z,2025-01-15T16:45:12Z,https://github.com/pytorch/pytorch/issues/45123,low,low,documentation
45067,"Add unit test for DataLoader pin_memory","The pin_memory functionality in DataLoader is missing unit tests. We need to add tests that verify tensors are properly pinned...",open,4,"['good first issue', 'module: tests', 'module: dataloader']",2025-01-14T09:18:55Z,2025-01-16T08:30:22Z,https://github.com/pytorch/pytorch/issues/45067,low,low,other
44982,"Update error message for mismatched tensor sizes","When tensors have mismatched sizes in operations, the error message could be clearer. Currently shows indices, but should show actual shapes...",open,1,"['good first issue', 'module: error messages']",2025-01-12T11:45:33Z,2025-01-13T10:12:44Z,https://github.com/pytorch/pytorch/issues/44982,low,low,feature

Summary

Using Composio and Datatune saves countless hours for engineering data engineering pipelines by abstracting away integration architecture and understanding semantic context for performing transformations on the data.

Datatune

How to make LLMs work on large amounts of data

Simplify the job for your Data Teams using Datatune

ScrapeGraphAI + Datatune, scrape the web, transform the data with Datatune

The Problem with Manual Research

What is ScrapeGraphAI?

Using ScrapeGraphAI for Laptop Research

Setup

Basic Implementation

Processing the Output with Datatune

Alternative Processing Approach

Final Output

Available Pipeline Types

Practical Applications

Getting Started

Results and Considerations

Conclusion

Give us a star!

Explore and Transform Your Data: A PandasAI + Datatune Tutorial

Heart Disease Risk Analysis with PandasAI and Datatune

What We'll Build

Prerequisites and Setup

Building the Heart Disease Analyzer

1. Initial Setup and Configuration

2. Data Loading with PandasAI Integration

3. Conversational Data Exploration

4. Intelligent Data Enrichment with Datatune

5. Advanced Analysis of Enriched Data

6. Results

7. Putting It All Together

Key Benefits of This Approach

Conclusion

Solving Complex Data pipelines with Composio + Datatune

Table of Contents:

The Data Integration Challenge

Composio : Integration Platform for AI Agents & LLMs**

Datatune: Perform transformations on your data with natural language

Real-World Example: Analyzing GitHub Issues

Install dependencies

Setup and Configuration

Connect to GitHub with Composio

Transform Data with Natural Language using Datatune

Summary