<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Datatune]]></title><description><![CDATA[Datatune]]></description><link>https://blog.datatune.ai</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1749613359401/877870dc-1d60-4a8f-82de-b68f59921bfb.png</url><title>Datatune</title><link>https://blog.datatune.ai</link></image><generator>RSS for Node</generator><lastBuildDate>Wed, 15 Apr 2026 21:14:19 GMT</lastBuildDate><atom:link href="https://blog.datatune.ai/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[How to make LLMs work on large amounts of data]]></title><description><![CDATA[Text to SQL tools have largely dominated the market of applying Intelligence over large amounts of data. However, with the advent of LLMs, this became a task dominated by several other tech, including RAG, Coding/SQL agents, etc.
One major issue with...]]></description><link>https://blog.datatune.ai/how-to-make-llms-work-on-large-amounts-of-data</link><guid isPermaLink="true">https://blog.datatune.ai/how-to-make-llms-work-on-large-amounts-of-data</guid><category><![CDATA[cursor]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[llm]]></category><category><![CDATA[AI]]></category><category><![CDATA[data-engineering]]></category><dc:creator><![CDATA[Abhijith Neil Abraham]]></dc:creator><pubDate>Sat, 17 Jan 2026 01:24:25 GMT</pubDate><content:encoded><![CDATA[<p>Text to SQL tools have largely dominated the market of applying Intelligence over large amounts of data. However, with the advent of LLMs, this became a task dominated by several other tech, including RAG, Coding/SQL agents, etc.</p>
<p>One major issue with this is that LLMs cannot actually see the data, they only receive a rough abstraction of it, such as summaries, samples, schema descriptions, or partial slices generated by another system.</p>
<p>What happens when you have a large number of rows to process and feed them into an LLM?</p>
<p>Let's see how we can tackle this using Datatune! {% embed <a target="_blank" href="https://github.com/vitalops/datatune">https://github.com/vitalops/datatune</a> %}</p>
<p><strong>The Context Length Problem</strong></p>
<p>LLMs are becoming larger and larger in terms of context length capabilities. However, even with the current pace, a 100M token context length model is no match for the data that comes from an average database of a user.</p>
<p>This means, the data that needs to be transformed can be several orders of magnitude higher than an LLMs's context length.</p>
<p>Consider a mid-sized enterprise with the following very normal setup:</p>
<ul>
<li><p>10 million rows in a transactional table</p>
</li>
<li><p>20 columns per row</p>
</li>
<li><p>Average 50 characters per column (IDs, text, timestamps, codes)</p>
</li>
</ul>
<p>That’s:</p>
<p>10,000,000 rows × 20 columns × 50 characters = 10,000,000,000 characters</p>
<p>Even with aggressive tokenization (≈ 4 characters per token):</p>
<p>10,000,000,000 ÷ 4 ≈ 2.5 billion tokens</p>
<p>Now compare this with an extremely optimistic LLM context window of 100 million tokens.</p>
<p>That single table alone is 25× larger than the model’s entire context.</p>
<p><strong>Solving Large Scale Data processing using Datatune</strong></p>
<p>With Datatune, users can give full access to the data for LLMs, with the help of batch processing.</p>
<p>Each row of data is transformed using the input prompt, while this combination is sent to the LLM in a batch, and this process continues until all batches of data are sent. Datatune uses Dask's parallel processing abilities to split the data into partitions and use it to send parallel batches to the LLM.</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/t1k3dnzw5qdjbqsxbxim.png" alt="Datatune Batch Processing" /></p>
<p><strong>Understanding Data Transformation Operations</strong></p>
<p>There are 4 first-order data transformation functions (also known as primitives), namely MAP, FILTER, EXPAND, and REDUCE</p>
<p>Datatune is also built on top of these primitives, where each primitive can be performed with natural language operations.</p>
<p>Eg:</p>
<pre><code class="lang-markdown">mapped = dt.map(
<span class="hljs-code">    prompt="Extract categories from the description and name of the product.",
    output_fields=["Category", "Subcategory"],
    input_fields=["Description", "Name"]
)(llm, df)</span>
</code></pre>
<p>In the above example, a MAP operation is performed using a prompt to get the output fields <code>Category</code> and <code>Subcategory</code> from the input fields such as <code>Description</code> and <code>Name</code>.</p>
<p>Datatune also can be used to chain multiple transformations together.</p>
<p>Here's another example where a MAP and FILTER are used together</p>
<pre><code class="lang-markdown"><span class="hljs-section"># First, extract sentiment and keywords from each review (MAP)</span>
mapped = dt.map(
<span class="hljs-code">    prompt="Classify the sentiment and extract key topics from the review text.",
    input_fields=["review_text"],
    output_fields=["sentiment", "topics"]
)(llm, df)
</span>
<span class="hljs-section"># Then, keep only negative reviews for further analysis (FILTER)</span>
filtered = dt.filter(
<span class="hljs-code">    prompt="Keep only rows where sentiment is negative."
)(llm, mapped)</span>
</code></pre>
<p><strong>Datatune Agents</strong></p>
<p>Datatune has Agents which helps the user perform prompts without having to know what primitives to use. It is also helpful when a query is complex and requires multi step transformations chained together.</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rtmwbqrlwn0mcsxbtw6t.png" alt="Datatune Agents" /></p>
<p>Here's an example where the previour MAP and FILTER operations that were chained together was solved with just a single prompt in Agents:</p>
<pre><code class="lang-markdown">df = agent.do(
<span class="hljs-code">    """
    From product name and description, extract Category and Subcategory.
    Then keep only products that belong to the Electronics category
    and have a price greater than 100.
    """,
    df
)</span>
</code></pre>
<p>The Agent also executes Python code along with row-level primitives (Map, Filter, etc). This is especially useful for some prompts that doesn't require row-level intelligence (numerical columns etc) as it can utilize Datatune's code generation capabilities to work on the data.</p>
<p><strong>Data Sources</strong></p>
<p>Datatune is designed to work with a wide variety of data sources including DataFrames and Databases. Users can use datatune with Ibis integration to help extend connectivity to Databases such as DuckDB, Postgres, MySQL, etc.</p>
<p><strong>Contributing</strong></p>
<p>We're building Datatune in open source, and we would love your contributions!</p>
<p>Check out the Github repository here:</p>
<p>Repo URL: <a target="_blank" href="https://github.com/vitalops/datatune">https://github.com/vitalops/datatune</a></p>
]]></content:encoded></item><item><title><![CDATA[Simplify the job for your Data Teams using Datatune]]></title><description><![CDATA[Data Engineering is hard, and finding the right data engineers for building your data pipelines is even harder.
In every modern company, data flows from dozens of directions: product analytics, billing systems, internal APIs, marketing platforms, clo...]]></description><link>https://blog.datatune.ai/simplify-the-job-for-your-data-teams-using-datatune</link><guid isPermaLink="true">https://blog.datatune.ai/simplify-the-job-for-your-data-teams-using-datatune</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Abhijith Neil Abraham]]></dc:creator><pubDate>Thu, 20 Nov 2025 03:25:19 GMT</pubDate><content:encoded><![CDATA[<p>Data Engineering is hard, and finding the right data engineers for building your data pipelines is even harder.</p>
<p>In every modern company, data flows from dozens of directions: product analytics, billing systems, internal APIs, marketing platforms, cloud storage, user logs, and more.<br />All of this data is supposed to power dashboards, decision-making, and AI workflows, where non-technical teams query and view dashboards with the data, and technical wrangle with data transformations.</p>
<p>There are tons of tools that attempt to solve this using Agents that <a target="_blank" href="https://www.k2view.com/blog/sql-agent-llm/">generate code, or SQL</a> that runs on your data. Another way to access the data would be to implement a <a target="_blank" href="https://www.datamanagementblog.com/query-rag-a-new-way-to-ground-llms-with-facts-and-create-powerful-data-agents/">RAG pipeline</a> on top of your data.</p>
<p>However, there is one problem. These solutions won’t fully understand the data.</p>
<p>Here’s an example:</p>
<p>Check out the following data with columns such as: Index, Customer ID, First Name, Last Name, Company, City, Country, Phone 1, Phone 2, Email, Subscription Date, Website</p>
<pre><code class="lang-markdown">Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website
1,DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group,East Leonard,Chile,229.077.5154,397.884.0519x718,zunigavanessa@smith.info,2020-08-24,http://www.stephenson.com/
2,1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry,East Jimmychester,Djibouti,5153435776,686-620-1820x944,vmata@colon.com,2021-04-23,http://www.hobbs.com/
3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,Antigua and Barbuda,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/
4,5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan",Bensonview,Dominican Republic,001-808-617-6467x12895,+1-813-324-8756,stanleyblackwell@benson.org,2020-06-02,http://www.good-lyons.com/
5,053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade",West Priscilla,Slovakia (Slovak Republic),001-234-203-0635x76146,001-199-446-3860x3486,colinalvarado@miles.net,2021-04-17,https://goodwin-ingram.com/
6,2d08FB17EE273F4,Aimee,Downs,Steele Group,Chavezborough,Bosnia and Herzegovina,(283)437-3886x88321,999-728-1637,louis27@gilbert.com,2020-02-25,http://www.berger.net/
7,EA4d384DfDbBf77,Darren,Peck,"Lester, Woodard and Mitchell",Lake Ana,Pitcairn Islands,(496)452-6181x3291,+1-247-266-0963x4995,tgates@cantrell.com,2021-08-24,https://www.le.com/
8,0e04AFde9f225dE,Brett,Mullen,"Sanford, Davenport and Giles",Kimport,Bulgaria,001-583-352-7197x297,001-333-145-0369,asnow@colon.com,2021-04-12,https://hammond-ramsey.com/
9,C2dE4dEEc489ae0,Sheryl,Meyers,Browning-Simon,Robersonstad,Cyprus,854-138-4911x5772,+1-448-910-2276x729,mariokhan@ryan-pope.org,2020-01-13,https://www.bullock.net/
10,8C2811a503C7c5a,Michelle,Gallagher,Beck-Hendrix,Elaineberg,Timor-Leste,739.218.2516x459,001-054-401-0347x617,mdyer@escobar.net,2021-11-08,https://arias.com/
</code></pre>
<p>We will try to apply some filters and also anonymise the personally identifiable information of only women in the data.</p>
<pre><code class="lang-markdown">prompt = '''
Filter location for the American Continent.
Anonymise personally identifiable information of only women in the data by marking them as anonymised
'''
</code></pre>
<p>And this is how the data should look like after the above two operations mentioned in the prompt:  </p>
<pre><code class="lang-markdown">Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website,Is<span class="hljs-emphasis">_American
0,1,DD37Cf93aecA6Dc,ANONYMIZED,ANONYMIZED,Rasmussen Group,East Leonard,Chile,ANONYMIZED,ANONYMIZED,ANONYMIZED,2020-08-24,http://www.stephenson.com/,True
2,3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,Antigua and Barbuda,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/,True
3,4,5Cef8BFA16c5e3c,ANONYMIZED,ANONYMIZED,"Dominguez, Mcmillan and Donovan",Bensonview,Dominican Republic,ANONYMIZED,ANONYMIZED,ANONYMIZED,2020-06-02,http://www.good-lyons.com/,True</span>
</code></pre>
<p>So the two operations here requires understanding each row of the data, where filtering my country requires making the pipeline look at the address field, and anonymising female names require classifying which names are female in the data.</p>
<p>Traditional Code/SQL agents will not understand each row of the data, and RAG is not explicitly equipped to work with such data structures and respond with the same structure.</p>
<p>This is where Datatune can help you better.</p>
<p>Datatune’s agents can automatically pick up the right operations required for the transformation, and then feed the LLM with the full data and handle the rate limits, and context window issues for you, all in a scalable fashion!</p>
<p>This means that a regular software engineer new to data engineering can easily pick up this job using Datatune Agents to build their Agentic software.</p>
<p>Here’s the full code example to set up Datatune Agents and perform the example above:  </p>
<p>Example: <a target="_blank" href="https://github.com/vitalops/datatune/blob/main/examples/data_anonymization.ipynb">https://github.com/vitalops/datatune/blob/main/examples/data_anonymization.ipynb</a></p>
]]></content:encoded></item><item><title><![CDATA[ScrapeGraphAI + Datatune, scrape the web, transform the data with Datatune]]></title><description><![CDATA[I was browsing Amazon for a new laptop, getting overwhelmed by the hundreds of options and confusing specifications. After spending two hours manually copying specs into a spreadsheet, I decided there had to be a better approach. That's when I rememb...]]></description><link>https://blog.datatune.ai/scrapegraphai-datatune-scrape-the-web-transform-the-data-with-datatune</link><guid isPermaLink="true">https://blog.datatune.ai/scrapegraphai-datatune-scrape-the-web-transform-the-data-with-datatune</guid><category><![CDATA[web scraping]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[llm]]></category><category><![CDATA[data extraction]]></category><category><![CDATA[data transformation]]></category><dc:creator><![CDATA[Abhijith Neil Abraham]]></dc:creator><pubDate>Sun, 20 Jul 2025 23:12:52 GMT</pubDate><content:encoded><![CDATA[<p>I was browsing Amazon for a new laptop, getting overwhelmed by the hundreds of options and confusing specifications. After spending two hours manually copying specs into a spreadsheet, I decided there had to be a better approach. That's when I remembered reading about ScrapeGraphAI and thought it might be worth trying.</p>
<h2 id="heading-the-problem-with-manual-research">The Problem with Manual Research</h2>
<p>Shopping for laptops online presents several challenges:</p>
<ul>
<li><p>Hundreds of models with varying specifications</p>
</li>
<li><p>Prices that change frequently</p>
</li>
<li><p>Reviews are scattered across different sections</p>
</li>
<li><p>Difficult comparison between similar models</p>
</li>
</ul>
<p>Like many people, I started by opening multiple browser tabs and manually collecting information. This process was time-consuming and prone to errors.</p>
<h2 id="heading-what-is-scrapegraphai">What is ScrapeGraphAI?</h2>
<p><a target="_blank" href="https://scrapegraphai.com/">ScrapeGraphAI</a> is a Python library that uses LLMs to extract data from websites and documents. Instead of writing complex scraping code, you describe what information you want in plain language, and the library handles the extraction.</p>
<p>Key features include:</p>
<ul>
<li><p><strong>Natural language instructions</strong>: No need for complex CSS selectors or XPath</p>
</li>
<li><p><strong>Multiple LLM support</strong>: Works with GPT, Gemini, Groq, Azure, and local models via Ollama</p>
</li>
<li><p><strong>Flexible formats</strong>: Handles XML, HTML, JSON, and Markdown documents</p>
</li>
</ul>
<h2 id="heading-using-scrapegraphai-for-laptop-research">Using ScrapeGraphAI for Laptop Research</h2>
<p>Here's how I used it to extract laptop information from Amazon:</p>
<h3 id="heading-setup">Setup</h3>
<pre><code class="lang-python">pip install scrapegraphai
playwright install
</code></pre>
<h3 id="heading-basic-implementation">Basic Implementation</h3>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> scrapegraphai.graphs <span class="hljs-keyword">import</span> SmartScraperGraph

graph_config = {
    <span class="hljs-string">"llm"</span>: {
        <span class="hljs-string">"model"</span>: <span class="hljs-string">"gpt-3.5-turbo"</span>,
        <span class="hljs-string">"api_key"</span>: <span class="hljs-string">"YOUR_API_KEY"</span>
    },
    <span class="hljs-string">"verbose"</span>: <span class="hljs-literal">True</span>,
    <span class="hljs-string">"headless"</span>: <span class="hljs-literal">False</span>
}

smart_scraper = SmartScraperGraph(
    prompt=<span class="hljs-string">"Extract laptop name, price, rating, key specifications, and availability from this Amazon search page"</span>,
    source=<span class="hljs-string">"https://amazon.com/s?k=laptops+under+1500"</span>,
    config=graph_config
)

results = smart_scraper.run()
</code></pre>
<h3 id="heading-processing-the-output-with-datatune">Processing the Output with Datatune</h3>
<p>Once ScrapeGraphAI extracts the data, you can use <a target="_blank" href="https://github.com/vitalops/datatune">datatune</a> to clean and filter the results.</p>
<p>Install datatune using pip</p>
<pre><code class="lang-python">pip install datatune
</code></pre>
<p>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> dask.dataframe <span class="hljs-keyword">as</span> dd
<span class="hljs-keyword">import</span> datatune <span class="hljs-keyword">as</span> dt
<span class="hljs-keyword">from</span> datatune.llm.llm <span class="hljs-keyword">import</span> OpenAI
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

os.environ[<span class="hljs-string">"OPENAI_API_KEY"</span>] = <span class="hljs-string">"your-openai-api-key"</span>
llm = OpenAI(model_name=<span class="hljs-string">"gpt-3.5-turbo"</span>, tpm=<span class="hljs-number">200000</span>, rpm=<span class="hljs-number">50</span>)

<span class="hljs-comment"># Convert ScrapeGraphAI JSON output to DataFrame</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">'smartscraper-2025-07-19.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
    scrape_data = json.load(f)

<span class="hljs-comment"># Extract laptops array and convert to DataFrame</span>
laptops_df = pd.DataFrame(scrape_data[<span class="hljs-string">'result'</span>][<span class="hljs-string">'laptops'</span>])
df = dd.from_pandas(laptops_df, npartitions=<span class="hljs-number">2</span>)

<span class="hljs-comment"># Map operation to standardize and enrich data</span>
mapped = dt.Map(
    prompt=<span class="hljs-string">"Standardize the processor name and categorize it as budget, mid-range, or high-end based on performance"</span>,
    output_fields=[<span class="hljs-string">"processor_category"</span>, <span class="hljs-string">"standardized_processor"</span>],
    input_fields=[<span class="hljs-string">"processor"</span>]
)(llm, df)

<span class="hljs-comment"># Filter for laptops with 16GB RAM and reasonable pricing</span>
filtered = dt.Filter(
    prompt=<span class="hljs-string">"Keep only laptops with 16GB RAM, price between $800-$2000, and battery life over 8 hours"</span>,
    input_fields=[<span class="hljs-string">"ram"</span>, <span class="hljs-string">"price"</span>, <span class="hljs-string">"batteryLife"</span>]
)(llm, mapped)

<span class="hljs-comment"># Additional mapping for value analysis</span>
value_mapped = dt.Map(
    prompt=<span class="hljs-string">"Calculate value score based on price, specs, and battery life. Rate as excellent, good, fair, or poor value"</span>,
    output_fields=[<span class="hljs-string">"value_rating"</span>, <span class="hljs-string">"value_score"</span>],
    input_fields=[<span class="hljs-string">"price"</span>, <span class="hljs-string">"ram"</span>, <span class="hljs-string">"ssd"</span>, <span class="hljs-string">"processor"</span>, <span class="hljs-string">"batteryLife"</span>, <span class="hljs-string">"weight"</span>]
)(llm, filtered)

<span class="hljs-comment"># Final cleanup and export</span>
result = dt.finalize(value_mapped)
final_df = result.compute()
final_df.to_csv(<span class="hljs-string">"filtered_laptops.csv"</span>, index=<span class="hljs-literal">False</span>)

print(<span class="hljs-string">f"Found <span class="hljs-subst">{len(final_df)}</span> laptops matching criteria"</span>)
print(final_df[[<span class="hljs-string">'brand'</span>, <span class="hljs-string">'model'</span>, <span class="hljs-string">'price'</span>, <span class="hljs-string">'value_rating'</span>]].head())
</code></pre>
<h3 id="heading-alternative-processing-approach">Alternative Processing Approach</h3>
<p>For more specific filtering based on use cases:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Filter for different use cases</span>
productivity_laptops = dt.Filter(
    prompt=<span class="hljs-string">"Keep laptops suitable for productivity work: good battery life, lightweight, reliable processor"</span>,
    input_fields=[<span class="hljs-string">"batteryLife"</span>, <span class="hljs-string">"weight"</span>, <span class="hljs-string">"processor"</span>, <span class="hljs-string">"ram"</span>]
)(llm, df)

gaming_laptops = dt.Filter(
    prompt=<span class="hljs-string">"Keep gaming laptops: powerful processor, dedicated graphics implied by model name, adequate RAM"</span>,
    input_fields=[<span class="hljs-string">"processor"</span>, <span class="hljs-string">"model"</span>, <span class="hljs-string">"ram"</span>, <span class="hljs-string">"price"</span>]
)(llm, df)

budget_laptops = dt.Filter(
    prompt=<span class="hljs-string">"Keep budget-friendly laptops under $1000 with decent specifications"</span>,
    input_fields=[<span class="hljs-string">"price"</span>, <span class="hljs-string">"ram"</span>, <span class="hljs-string">"processor"</span>, <span class="hljs-string">"ssd"</span>]
)(llm, df)

<span class="hljs-comment"># Process each category</span>
categories = {
    <span class="hljs-string">"productivity"</span>: productivity_laptops,
    <span class="hljs-string">"gaming"</span>: gaming_laptops, 
    <span class="hljs-string">"budget"</span>: budget_laptops
}

<span class="hljs-keyword">for</span> category, filtered_df <span class="hljs-keyword">in</span> categories.items():
    <span class="hljs-comment"># Add category information</span>
    categorized = dt.Map(
        prompt=<span class="hljs-string">f"Add category '<span class="hljs-subst">{category}</span>' and recommend this laptop with brief reasoning"</span>,
        output_fields=[<span class="hljs-string">"category"</span>, <span class="hljs-string">"recommendation_reason"</span>],
        input_fields=[<span class="hljs-string">"brand"</span>, <span class="hljs-string">"model"</span>, <span class="hljs-string">"price"</span>, <span class="hljs-string">"processor"</span>, <span class="hljs-string">"ram"</span>]
    )(llm, filtered_df)

    final_result = dt.finalize(categorized)
    final_result.compute().to_csv(<span class="hljs-string">f"<span class="hljs-subst">{category}</span>_laptops.csv"</span>, index=<span class="hljs-literal">False</span>)
</code></pre>
<h2 id="heading-final-output">Final Output</h2>
<p>After running the complete pipeline (ScrapeGraphAI extraction + datatune processing), here's what the final dataset looks like:</p>
<pre><code class="lang-plaintext">brand,model,price,ram,ssd,processor,screenSize,batteryLife,weight,releaseYear,processor_category,standardized_processor,value_rating,value_score,category,recommendation_reason
Apple,MacBook Pro,1299,16,512,Apple M1,13.3,18,1.4,2020,high-end,Apple M1 8-core,excellent,9.2,productivity,"Outstanding battery life and performance for professional work, lightweight design ideal for mobile productivity"
Lenovo,ThinkPad X1 Carbon,1399,16,512,Intel Core i7,14,15,1.1,2021,high-end,Intel Core i7-1165G7,good,8.1,productivity,"Ultra-lightweight business laptop with excellent build quality and long battery life, perfect for professionals"
ASUS,TUF Gaming A15,999,8,512,AMD Ryzen 5,15.6,10,2.2,2021,mid-range,AMD Ryzen 5 4600H,good,7.8,gaming,"Solid gaming performance at affordable price point, good processor and adequate RAM for modern games"
Acer,Aspire 5,599,8,256,Intel Core i5,15.6,9,1.75,2021,mid-range,Intel Core i5-1135G7,excellent,8.9,budget,"Outstanding value for money with solid specs for everyday computing, good balance of price and performance"
Microsoft,Surface Laptop 4,999,8,256,AMD Ryzen 5,13.5,19,1.27,2021,mid-range,AMD Ryzen 5 4680U,good,8.0,productivity,"Premium build quality with exceptional battery life, ideal for students and professionals who value portability"
</code></pre>
<h2 id="heading-available-pipeline-types">Available Pipeline Types</h2>
<p>ScrapeGraphAI offers several scraping approaches:</p>
<ul>
<li><p><strong>SmartScraperGraph</strong>: Single-page extraction with user prompts</p>
</li>
<li><p><strong>SearchGraph</strong>: Multi-page scraping from search results</p>
</li>
<li><p><strong>SpeechGraph</strong>: Converts web content to audio files</p>
</li>
<li><p><strong>ScriptCreatorGraph</strong>: Generates Python scripts for extracted data</p>
</li>
<li><p><strong>SmartScraperMultiGraph</strong>: Handles multiple pages simultaneously</p>
</li>
</ul>
<h2 id="heading-practical-applications">Practical Applications</h2>
<p>Beyond laptop shopping, this approach works for:</p>
<p><strong>E-commerce and Retail</strong></p>
<ul>
<li><p>Price monitoring across competitors</p>
</li>
<li><p>Product availability tracking</p>
</li>
<li><p>Review analysis</p>
</li>
<li><p>Market trend identification</p>
</li>
</ul>
<p><strong>Business Intelligence</strong></p>
<ul>
<li><p>Lead generation from directories</p>
</li>
<li><p>Competitor analysis</p>
</li>
<li><p>Market research automation</p>
</li>
<li><p>Social media monitoring</p>
</li>
</ul>
<p><strong>Research and Academia</strong></p>
<ul>
<li><p>Academic paper data extraction</p>
</li>
<li><p>Survey data collection</p>
</li>
<li><p>Content analysis</p>
</li>
<li><p>Dataset creation</p>
</li>
</ul>
<p><strong>Financial Services</strong></p>
<ul>
<li><p>Stock price tracking</p>
</li>
<li><p>Financial news monitoring</p>
</li>
<li><p>Economic indicator collection</p>
</li>
<li><p>Risk assessment data</p>
</li>
</ul>
<h2 id="heading-getting-started">Getting Started</h2>
<p>To try ScrapeGraphAI yourself:</p>
<ol>
<li><strong>Installation</strong></li>
</ol>
<pre><code class="lang-python">pip install scrapegraphai
playwright install
</code></pre>
<ol start="2">
<li><strong>Basic Usage</strong></li>
</ol>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> scrapegraphai.graphs <span class="hljs-keyword">import</span> SmartScraperGraph

config = {
    <span class="hljs-string">"llm"</span>: {
        <span class="hljs-string">"model"</span>: <span class="hljs-string">"gpt-3.5-turbo"</span>, 
        <span class="hljs-string">"api_key"</span>: <span class="hljs-string">"your-api-key"</span>
    }
}

scraper = SmartScraperGraph(
    prompt=<span class="hljs-string">"Extract product names and prices"</span>,
    source=<span class="hljs-string">"https://example-shop.com"</span>,
    config=config
)

data = scraper.run()
print(data)
</code></pre>
<ol start="3">
<li><strong>Advanced Features</strong></li>
</ol>
<ul>
<li><p>Use local models with Ollama for privacy</p>
</li>
<li><p>Handle JavaScript-heavy sites with headless browsers</p>
</li>
<li><p>Process multiple pages simultaneously</p>
</li>
<li><p>Generate audio summaries of web content</p>
</li>
</ul>
<h2 id="heading-results-and-considerations">Results and Considerations</h2>
<p>In my laptop research, ScrapeGraphAI reduced data collection time from hours to minutes. The structured JSON output made it easy to compare options and identify the best value laptops in my budget range.  </p>
<p>You can check out more examples here for ScrapegraphAI: <a target="_blank" href="https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py/examples/sync">https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py/examples/sync</a></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>ScrapeGraphAI offers a practical solution for automated data extraction without requiring extensive programming knowledge. It significantly reduces the manual effort involved in gathering structured data from websites.</p>
<p>For anyone regularly collecting data from websites, whether for research, business intelligence, or personal projects, it's worth exploring. The natural language interface makes it accessible to non-programmers, while the flexibility supports more complex use cases.</p>
<p>The combination with datatune for post-processing provides a complete pipeline from raw web data to cleaned, categorized datasets ready for analysis.</p>
<h2 id="heading-give-us-a-star">Give us a star!</h2>
<p>Datatune: <a target="_blank" href="https://github.com/vitalops/datatune">https://github.com/vitalops/datatune</a></p>
<p>ScrapeGraphAI: <a target="_blank" href="https://github.com/ScrapeGraphAI/scrapegraph-sdk">https://github.com/ScrapeGraphAI/scrapegraph-sdk</a></p>
]]></content:encoded></item><item><title><![CDATA[Explore and Transform Your Data: A PandasAI + Datatune Tutorial]]></title><description><![CDATA[Heart Disease Risk Analysis with PandasAI and Datatune
Data analysis workflows have changed significantly with the introduction of Large Language Models. Instead of writing complex pandas operations, we can now use natural language to explore data an...]]></description><link>https://blog.datatune.ai/explore-and-transform-your-data-a-pandasai-datatune-tutorial</link><guid isPermaLink="true">https://blog.datatune.ai/explore-and-transform-your-data-a-pandasai-datatune-tutorial</guid><category><![CDATA[Data analysis, Data transformation, Natural language, LLM, AI]]></category><dc:creator><![CDATA[Abhijith Neil Abraham]]></dc:creator><pubDate>Tue, 08 Jul 2025 12:45:54 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-heart-disease-risk-analysis-with-pandasai-and-datatune">Heart Disease Risk Analysis with PandasAI and Datatune</h1>
<p>Data analysis workflows have changed significantly with the introduction of Large Language Models. Instead of writing complex pandas operations, we can now use natural language to explore data and perform intelligent transformations. This tutorial demonstrates how to build a heart disease risk analyzer using PandasAI for conversational data exploration and Datatune for AI-powered data enrichment.  </p>
<p>PandasAI Github: <a target="_blank" href="https://github.com/sinaptik-ai/pandas-ai/">https://github.com/sinaptik-ai/pandas-ai/</a></p>
<p>Datatune Github: <a target="_blank" href="https://github.com/vitalops/datatune">https://github.com/vitalops/datatune</a></p>
<h2 id="heading-what-well-build">What We'll Build</h2>
<p>Our heart disease analyzer will:</p>
<ul>
<li><p>Load medical data and explore it using natural language queries</p>
</li>
<li><p>Use AI to generate risk assessments and patient categories</p>
</li>
<li><p>Filter high-risk patients for targeted analysis</p>
</li>
<li><p>Save structured insights for healthcare teams</p>
</li>
</ul>
<h2 id="heading-prerequisites-and-setup">Prerequisites and Setup</h2>
<p>Install the required packages:</p>
<pre><code class="lang-bash">pip install pandas pandasai datatune dask python-dotenv
</code></pre>
<p>Create a <code>.env</code> file with your Azure OpenAI credentials:</p>
<pre><code class="lang-plaintext">AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_API_BASE=https://your-resource.openai.azure.com/
AZURE_API_VERSION=2025-01-01-preview
</code></pre>
<h2 id="heading-building-the-heart-disease-analyzer">Building the Heart Disease Analyzer</h2>
<h3 id="heading-1-initial-setup-and-configuration">1. Initial Setup and Configuration</h3>
<pre><code class="lang-python"><span class="hljs-comment">#!/usr/bin/env python3</span>
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> pandasai <span class="hljs-keyword">as</span> pai
<span class="hljs-keyword">import</span> datatune <span class="hljs-keyword">as</span> dt
<span class="hljs-keyword">import</span> dask.dataframe <span class="hljs-keyword">as</span> dd
<span class="hljs-keyword">from</span> datatune.llm.llm <span class="hljs-keyword">import</span> Azure
<span class="hljs-keyword">from</span> pandasai_openai <span class="hljs-keyword">import</span> AzureOpenAI
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> json

load_dotenv()

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">HeartDiseaseAnalyzer</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        api_key = os.getenv(<span class="hljs-string">"AZURE_OPENAI_API_KEY"</span>)
        api_base = os.getenv(<span class="hljs-string">"AZURE_API_BASE"</span>)
        api_version = os.getenv(<span class="hljs-string">"AZURE_API_VERSION"</span>, <span class="hljs-string">"2025-01-01-preview"</span>)

        self.llm = Azure(
            model_name=<span class="hljs-string">"gpt-4o-mini"</span>,
            api_key=api_key,
            api_base=api_base,
            api_version=api_version,
        )

        pandas_llm = AzureOpenAI(
            api_key=api_key,
            azure_endpoint=api_base,
            api_version=api_version,
            deployment_name=<span class="hljs-string">'gpt-4o-mini'</span>
        )

        pai.config.set({<span class="hljs-string">"llm"</span>: pandas_llm})

        self.df = <span class="hljs-literal">None</span>
        self.pai_df = <span class="hljs-literal">None</span>
        self.enriched_df = <span class="hljs-literal">None</span>
        self.high_risk_df = <span class="hljs-literal">None</span>
        self.agent = <span class="hljs-literal">None</span>
</code></pre>
<p>This setup creates two separate LLM instances: one for Datatune transformations and another for PandasAI exploration. We are using Azure OpenAI instances for both tools.</p>
<h3 id="heading-2-data-loading-with-pandasai-integration">2. Data Loading with PandasAI Integration</h3>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_data</span>(<span class="hljs-params">self, filepath=<span class="hljs-string">"heart.csv"</span></span>):</span>
    self.df = pd.read_csv(filepath)
    self.pai_df = pai.DataFrame(self.df)
    self.agent = pai.Agent(self.pai_df)
    <span class="hljs-keyword">return</span> self.df
</code></pre>
<p>We load data into both regular pandas and PandasAI DataFrames. The <code>pai.Agent</code> creates a conversational interface that allows us to ask questions about our data in plain English while maintaining access to traditional pandas operations.</p>
<h3 id="heading-3-conversational-data-exploration">3. Conversational Data Exploration</h3>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">explore_data</span>(<span class="hljs-params">self</span>):</span>
    questions = [
        <span class="hljs-string">"What's the distribution of heart disease cases?"</span>,
        <span class="hljs-string">"What's the average age of patients with and without heart disease?"</span>,
        <span class="hljs-string">"Which chest pain types are most common in heart disease patients?"</span>,
        <span class="hljs-string">"What's the correlation between cholesterol levels and heart disease?"</span>
    ]

    insights = {}
    <span class="hljs-keyword">for</span> question <span class="hljs-keyword">in</span> questions:
        response = self.pai_df.chat(question)
        insights[question] = str(response)

    <span class="hljs-keyword">with</span> open(<span class="hljs-string">"exploration_insights.json"</span>, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> f:
        json.dump(insights, f, indent=<span class="hljs-number">2</span>)

    <span class="hljs-keyword">return</span> insights
</code></pre>
<p>Instead of writing complex code in Python, we ask direct questions about our data. PandasAI understands the context with the help of LLMS and generates appropriate statistical analysis. The results are saved as JSON for later review and reporting.</p>
<p>You can ask questions like:</p>
<ul>
<li><p>"What percentage of patients have high blood pressure?"</p>
</li>
<li><p>"Which features correlate most strongly with heart disease?"</p>
</li>
</ul>
<h3 id="heading-4-intelligent-data-enrichment-with-datatune">4. Intelligent Data Enrichment with Datatune</h3>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">enrich_data</span>(<span class="hljs-params">self</span>):</span>
    dask_df = dd.from_pandas(self.df, npartitions=<span class="hljs-number">4</span>)

    risk_profiled = dt.Map(
        prompt=(
            <span class="hljs-string">"For each patient, analyze their health metrics and assign:\n"</span>
            <span class="hljs-string">"- `severity_score`: calculate a numeric score from 1-10 based on their combined risk factors (age, cholesterol, BP, etc.)\n"</span>
            <span class="hljs-string">"- `age_category`: categorize age into [young (&lt;40), middle-aged (40–60), senior (&gt;60)]\n"</span>
            <span class="hljs-string">"- `risk_profile`: based on their overall health indicators, assign [low-risk, moderate-risk, high-risk]\n"</span>
            <span class="hljs-string">"Consider all available health metrics for comprehensive assessment."</span>
        ),
        output_fields=[<span class="hljs-string">"severity_score"</span>, <span class="hljs-string">"age_category"</span>, <span class="hljs-string">"risk_profile"</span>]
    )(self.llm, dask_df)

    high_risk_patients = dt.Filter(
        prompt=(
            <span class="hljs-string">"Keep only patients who have either high severity scores (7+) or are classified as high-risk profile and are middle-aged."</span>
        )
    )(self.llm, risk_profiled)

    self.enriched_df = dt.finalize(risk_profiled.compute())
    self.high_risk_df = dt.finalize(high_risk_patients.compute())
    <span class="hljs-keyword">return</span> self.enriched_df
</code></pre>
<p>This is where Datatune shows its strength. The <code>Map</code> operation creates new columns (and can also replace existing columns) by analyzing multiple health metrics simultaneously. The AI understands context and generates appropriate risk assessments based on medical terminologies. The <code>Filter</code> operation uses natural language to apply complex filtering logic that would require contextual understanding of the data.</p>
<p>The use of Dask allows us to process larger datasets efficiently through distributed computing.</p>
<h3 id="heading-5-advanced-analysis-of-enriched-data">5. Advanced Analysis of Enriched Data</h3>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">analyze_enriched</span>(<span class="hljs-params">self</span>):</span>
    enriched_pai_df = pai.DataFrame(self.enriched_df)

    questions = [
        <span class="hljs-string">"What's the distribution of severity scores across the dataset?"</span>,
        <span class="hljs-string">"How do risk profiles correlate with actual heart disease cases?"</span>,
        <span class="hljs-string">"Which age category has the highest proportion of high-risk profiles?"</span>
    ]

    insights = {}
    <span class="hljs-keyword">for</span> question <span class="hljs-keyword">in</span> questions:
        response = enriched_pai_df.chat(question)
        insights[question] = str(response)

    <span class="hljs-keyword">with</span> open(<span class="hljs-string">"enriched_analysis_insights.json"</span>, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> f:
        json.dump(insights, f, indent=<span class="hljs-number">2</span>)

    <span class="hljs-keyword">return</span> insights
</code></pre>
<p>After enriching our data, we can explore the new features to validate our AI-generated insights. This step helps us understand whether our enrichment strategy is working effectively and reveals new patterns in the data.</p>
<h3 id="heading-6-results">6. Results</h3>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">save_results</span>(<span class="hljs-params">self</span>):</span>
    print(<span class="hljs-string">'saving results'</span>)
    self.enriched_df.to_csv(<span class="hljs-string">"enriched_heart_data.csv"</span>, index=<span class="hljs-literal">False</span>)
    self.high_risk_df.to_csv(<span class="hljs-string">"high_risk_patients.csv"</span>, index=<span class="hljs-literal">False</span>)
</code></pre>
<p>We save both the enriched dataset and the filtered high-risk patients for further analysis or use in downstream applications.</p>
<h3 id="heading-7-putting-it-all-together">7. Putting It All Together</h3>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    analyzer = HeartDiseaseAnalyzer()
    analyzer.load_data()
    analyzer.explore_data()
    analyzer.enrich_data()
    analyzer.analyze_enriched()
    analyzer.save_results()

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<h2 id="heading-key-benefits-of-this-approach">Key Benefits of This Approach</h2>
<p><strong>PandasAI Advantages:</strong></p>
<ul>
<li><p>Non-technical stakeholders can explore data using natural language</p>
</li>
<li><p>Rapid prototyping without complex pandas code</p>
</li>
<li><p>AI can identify patterns that might be missed</p>
</li>
<li><p>Automatic visualization generation</p>
</li>
</ul>
<p><strong>Datatune Advantages:</strong></p>
<ul>
<li><p>AI understands domain context for better feature engineering</p>
</li>
<li><p>Natural language expressions for complex data operations</p>
</li>
<li><p>Built on Dask for scalable processing</p>
</li>
<li><p>Reproducible transformations with clear prompts</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The combination of PandasAI and Datatune represents a shift toward more intuitive data analysis, by helping improve data exploration across technical and non-technical teams while generating insights faster, and more deeply.</p>
<p>This approach is particularly valuable in several domains including, but not limited to, healthcare, finance, etc. The AI models can incorporate domain knowledge that would be difficult to encode in traditional rule-based systems.</p>
]]></content:encoded></item><item><title><![CDATA[Solving Complex Data pipelines with Composio + Datatune]]></title><description><![CDATA[Table of Contents:

The Data Integration Challenge

Composio : Integration Platform  for AI Agents & LLMs

Datatune: Perform transformations on your data with natural language

Real-World Example: Analyzing GitHub Issues

Summary


The Data Integrati...]]></description><link>https://blog.datatune.ai/solving-complex-data-pipelines-with-composio-datatune</link><guid isPermaLink="true">https://blog.datatune.ai/solving-complex-data-pipelines-with-composio-datatune</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[dataengineering]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[Abhijith Neil Abraham]]></dc:creator><pubDate>Wed, 11 Jun 2025 03:34:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1749614067713/7071d5b8-fafb-4a8f-8bc5-83dfe2126fe6.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-table-of-contents"><strong>Table of Contents:</strong></h1>
<ul>
<li><p><em>The Data Integration Challenge</em></p>
</li>
<li><p><em>Composio : Integration Platform<br />  for AI Agents &amp; LLMs</em></p>
</li>
<li><p><em>Datatune: Perform transformations on your data with natural language</em></p>
</li>
<li><p><em>Real-World Example: Analyzing GitHub Issues</em></p>
</li>
<li><p>Summary</p>
</li>
</ul>
<h1 id="heading-the-data-integration-challenge"><strong>The Data Integration Challenge</strong></h1>
<p>Modern businesses rely on data scattered across dozens of platforms such as CRMs, project management tools, communication platforms, databases, and APIs. The traditional approach involves:</p>
<ul>
<li><p>Complex API integrations with different authentication methods</p>
</li>
<li><p>Constant maintenance as APIs change and evolve</p>
</li>
<li><p>Complicated code to perform data transformations that requires semantic context understanding.</p>
</li>
</ul>
<p>What if you could connect to any external service, pull data, and transform it using nothing but natural language? Let’s see how we can engineer a data pipeline using Composio and Datatune.</p>
<h1 id="heading-composio-integration-platform-for-ai-agents-amp-llms">Composio : Integration Platform for AI Agents &amp; LLMs**</h1>
<p><a target="_blank" href="https://composio.dev/">Composio</a> eliminates integration complexity by providing:</p>
<ul>
<li><p>200+ pre-built integrations across every major platform (eg: Salesforce, GitHub, Slack, Google Sheets, Notion, etc**)**</p>
</li>
<li><p>One-click authentication handling OAuth, API keys, and complex flows</p>
</li>
<li><p>Unified interface that abstracts away API differences</p>
</li>
<li><p>Built for AI workflows with structured, consistent outputs.</p>
</li>
</ul>
<p>Using Composeio can also help you connect MCPs to your AI Agents, so with very few steps you can avoid painful API orchestration, redundant boilerplate code, and platform-specific edge cases.</p>
<h1 id="heading-datatune-perform-transformations-on-your-data-with-natural-language"><strong>Datatune: Perform transformations on your data with natural language</strong></h1>
<p>One of the major complexities of data pipelines is transforming messy tabular data into clean, usable formats, especially when the transformation requires understanding the semantic meaning of the data and the task at hand.</p>
<p>Consider a sales spreadsheet with product names like “iPhone 15 Pro Max 256GB Blue”. Extracting just the color would normally require complex regex patterns to handle every variation. With <a target="_blank" href="https://github.com/vitalops/datatune">Datatune</a>, you simply say “Extract the color from product name” and it understands context automatically. You can then chain operations naturally: first extract colors and categories with Map, then Filter to “Keep only blue electronics,” and finally clean up the results.</p>
<p>This approach is powerful because each step builds on the previous one, letting you transform millions of rows by describing what you want rather than writing long lines of pandas or regex code for the same.</p>
<h1 id="heading-real-world-example-analyzing-github-issues"><strong>Real-World Example: Analyzing GitHub Issues</strong></h1>
<p>Let’s walk through a practical example where we analyze GitHub issues to help maintainers prioritize their work. We will fetch issues from <a target="_blank" href="https://github.com/pytorch/pytorch">pytorch</a>’s Github repository using composio, and will process the data using datatune to find the issues that could be “good first contributions for new developers”. Here’s how simple it becomes with Composio + Datatune:</p>
<h2 id="heading-install-dependencies"><strong>Install dependencies</strong></h2>
<p>Install both libraries (dask will be automatically installed with datatune) and dotenv for loading your environment variables:</p>
<pre><code class="lang-plaintext">pip install composio datatune python-dotenv
</code></pre>
<h2 id="heading-setup-and-configuration"><strong>Setup and Configuration</strong></h2>
<p>We need API keys to interact with Composio. Go to <a target="_blank" href="https://app.composio.dev/">https://app.composio.dev</a> and login and get your API key.</p>
<p>For using LLMs with Datatune, you can either use OpenAI, or local models with the help of Ollama, or from any other API providers such as Azure. For more info about how to use different providers, refer to this link: <a target="_blank" href="https://docs.datatune.ai/LLM.html">https://docs.datatune.ai/LLM.html</a></p>
<p>For the sake of this article, we will use Azure OpenAI as the provider.</p>
<p>Once you’re ready with all the credentials, create a .env file and add your environment variables like this:</p>
<pre><code class="lang-plaintext">COMPOSIO_API_KEY=your-composio-key
AZURE_OPENAI_API_KEY=your-key
AZURE_API_BASE=https://your-endpoint.openai.azure.com/
AZURE_API_VERSION=2024-02-01
</code></pre>
<p>Let’s import the libraries:</p>
<pre><code class="lang-plaintext">

import os
import pandas as pd
import dask.dataframe as dd
import datatune as dt
from composio import ComposioToolSet, App, Action
from datatune.core.map import Map
from datatune.core.filter import Filter
from datatune.llm.llm import Azure
from dotenv import load_dotenv
load_dotenv()


COMPOSIO_API_KEY = os.getenv("COMPOSIO_API_KEY")
api_key = os.getenv("AZURE_OPENAI_API_KEY")
api_base = os.getenv("AZURE_API_BASE")
api_version = os.getenv("AZURE_API_VERSION", "2024-02-01")
</code></pre>
<h2 id="heading-connect-to-github-with-composio"><strong>Connect to GitHub with Composio</strong></h2>
<p>We will use Composio to connect to the GitHub repository of <a target="_blank" href="https://github.com/pytorch/pytorch">Pytorch</a> using ComposioToolset. Composio provides several actions that a user can perform from their integration choice. In our case, we can use the action GITHUB_LIST_REPOSITORY_ISSUES which will return us with the required data from Github issues of pytorch with the help of the following function.</p>
<p>Let’s get the issues from <a target="_blank" href="https://github.com/pytorch/pytorch">https://github.com/pytorch/pytorch</a>, so set the repo owner as ‘pytorch’ and repo_name as ‘pytorch’</p>
<pre><code class="lang-plaintext">def fetch_github_issues(toolset, repo_owner="pytorch", repo_name="pytorch", limit=30):
    result = toolset.execute_action(
        action=Action.GITHUB_LIST_REPOSITORY_ISSUES,
        params={
            "owner": repo_owner,
            "repo": repo_name,
            "state": "open",
            "per_page": 30
        }
    )

    # Extract issues data from result
    issues_data = []

    if isinstance(result, dict) and result.get('successful'):
        data = result.get('data', {})

        if isinstance(data, list):
            issues_data = data
        elif isinstance(data, dict):
            # Look for issues in common response patterns
            for key in ['details', 'items', 'data', 'issues', 'results']:
                if key in data and isinstance(data[key], list):
                    issues_data = data[key]
                    break

            # Check if it's a single issue object
            if not issues_data and 'number' in data and 'title' in data:
                issues_data = [data]

    elif isinstance(result, list):
        issues_data = result

    if not isinstance(issues_data, list):
        return pd.DataFrame()

    # Process issues into DataFrame
    processed_issues = []
    for i, issue in enumerate(issues_data):
        if i &gt;= limit:
            break

        if isinstance(issue, dict):
            processed_issues.append({
                "issue_number": issue.get("number"),
                "title": issue.get("title", ""),
                "issue_body": issue.get("body", "")[:500] if issue.get("body") else "",
                "state": issue.get("state", ""),
                "comments_count": issue.get("comments", 0),
                "labels": [label.get("name", "") for label in issue.get("labels", [])] if issue.get("labels") else [],
                "created_at": issue.get("created_at", ""),
                "updated_at": issue.get("updated_at", ""),
                "html_url": issue.get("html_url", "")
            })

    return pd.DataFrame(processed_issues)
</code></pre>
<h2 id="heading-transform-data-with-natural-language-using-datatune"><strong>Transform Data with Natural Language using Datatune</strong></h2>
<p>The result data from the above function contains the following columns: issue_number, title, issue_body, state, comments_count, labels, created_at, updated_at, html_url</p>
<p>Instead of complicated python code to make edits to this data, we will simply use Datatune.</p>
<p>We will perform two major operations chained together.</p>
<ol>
<li><strong>Map Operation: To Replace values or Add new columns based on existing data</strong></li>
</ol>
<p>In our case, we will perform the map operation primarily to classify issues into severity levels, estimated efforts, and issue types (bug or feature) and output this data into respective new columns.</p>
<p><strong>2. Filter Operation: Remove Specific Rows</strong></p>
<p>We will remove the rows that are not good first issues.</p>
<p>Let’s see how we can write prompts into datatune to perform these operations and chain them. We will use gpt-4.1-mini as the choice of LLM for both operations. Since Datatune uses Dask under the hood, we use the .compute() method on the dask dataframe to trigger the data transformation. In the end, we apply dt.finalize() to clear internal metadata that was created during this process .</p>
<pre><code class="lang-plaintext">def analyze_with_datatune(df):
    if df.empty:
        return pd.DataFrame(), pd.DataFrame()

    dask_df = dd.from_pandas(df, npartitions=1)

    llm = Azure(
        model_name="gpt-4.1-mini",
        api_key=api_key,
        api_base=api_base,
        api_version=api_version,
    )


    # Map operation: Analyze each issue
    mapped = Map(
        prompt="Based on the issue title, description, and labels, determine: 1) severity (high/medium/low) - consider critical bugs, memory leaks, crashes as high; 2) estimated effort to fix (high/medium/low); 3) issue type (bug/feature/documentation/other)",
        output_fields=["severity", "estimated_effort", "issue_type"]
    )(llm, dask_df)

    # Filter operation: Find high severity issues
    good_first_issues = Filter(
        prompt="Keep issues that look like they could be good first contributions for new developers"
    )(llm, mapped)


    return dt.finalize(good_first_issues.compute())
</code></pre>
<p>Let’s wrap everything up and take a look at the full code:</p>
<pre><code class="lang-plaintext">import os
import pandas as pd
import dask.dataframe as dd
from composio import ComposioToolSet, App, Action
from datatune.core.map import Map
from datatune.core.filter import Filter
from datatune.llm.llm import Azure
import datatune as dt
from dotenv import load_dotenv
load_dotenv()

# Configuration
COMPOSIO_API_KEY = os.getenv("COMPOSIO_API_KEY")
api_key = os.getenv("AZURE_OPENAI_API_KEY")
api_base = os.getenv("AZURE_API_BASE")
api_version = os.getenv("AZURE_API_VERSION", "2024-02-01")

def setup_composio():
    toolset = ComposioToolSet(api_key=COMPOSIO_API_KEY)
    return toolset

def fetch_github_issues(toolset, repo_owner="pytorch", repo_name="pytorch", limit=30):
    result = toolset.execute_action(
        action=Action.GITHUB_LIST_REPOSITORY_ISSUES,
        params={
            "owner": repo_owner,
            "repo": repo_name,
            "state": "open",
            "per_page": limit
        }
    )

    # Extract issues data from result
    issues_data = []

    if isinstance(result, dict) and result.get('successful'):
        data = result.get('data', {})

        if isinstance(data, list):
            issues_data = data
        elif isinstance(data, dict):
            # Look for issues in common response patterns
            for key in ['details', 'items', 'data', 'issues', 'results']:
                if key in data and isinstance(data[key], list):
                    issues_data = data[key]
                    break

            # Check if it's a single issue object
            if not issues_data and 'number' in data and 'title' in data:
                issues_data = [data]

    elif isinstance(result, list):
        issues_data = result

    if not isinstance(issues_data, list):
        return pd.DataFrame()

    # Process issues into DataFrame
    processed_issues = []
    for i, issue in enumerate(issues_data):
        if i &gt;= limit:
            break

        if isinstance(issue, dict):
            processed_issues.append({
                "issue_number": issue.get("number"),
                "title": issue.get("title", ""),
                "issue_body": issue.get("body", "")[:500] if issue.get("body") else "",
                "state": issue.get("state", ""),
                "comments_count": issue.get("comments", 0),
                "labels": [label.get("name", "") for label in issue.get("labels", [])] if issue.get("labels") else [],
                "created_at": issue.get("created_at", ""),
                "updated_at": issue.get("updated_at", ""),
                "html_url": issue.get("html_url", "")
            })

    return pd.DataFrame(processed_issues)

def analyze_with_datatune(df):
    if df.empty:
        return pd.DataFrame(), pd.DataFrame()

    dask_df = dd.from_pandas(df, npartitions=1)

    llm = Azure(
        model_name="gpt-4.1-mini",
        api_key=api_key,
        api_base=api_base,
        api_version=api_version,
    )


    # Map operation: Analyze each issue
    mapped = Map(
        prompt="Based on the issue title, description, and labels, determine: 1) severity (high/medium/low) - consider critical bugs, memory leaks, crashes as high; 2) estimated effort to fix (high/medium/low); 3) issue type (bug/feature/documentation/other)",
        output_fields=["severity", "estimated_effort", "issue_type"]
    )(llm, dask_df)

    # Filter operation: Find high severity issues
    good_first_issues = Filter(
        prompt="Keep issues that look like they could be good first contributions for new developers"
    )(llm, mapped)

    final_df = good_first_issues.compute()
    return dt.finalize(final_df)


def main():
    toolset = setup_composio()
    issues_df = fetch_github_issues(toolset)

    good_first_issues = analyze_with_datatune(issues_df)
    if not good_first_issues.empty:
        good_first_issues.to_csv("good_first_issues.csv", index=False)
        print(f"  - good_first_issues.csv ({len(good_first_issues)} issues)")

if __name__ == "__main__":
    main()
</code></pre>
<p>The results in the good_first_issues.csv should look something like this:</p>
<pre><code class="lang-plaintext">issue_number,title,issue_body,state,comments_count,labels,created_at,updated_at,html_url,severity,estimated_effort,issue_type
45123,"Fix typo in torch.nn.functional documentation","There's a typo in the documentation for F.relu where 'activation' is misspelled as 'activaton'. This should be a simple fix...",open,2,"['good first issue', 'module: docs']",2025-01-15T14:22:31Z,2025-01-15T16:45:12Z,https://github.com/pytorch/pytorch/issues/45123,low,low,documentation
45067,"Add unit test for DataLoader pin_memory","The pin_memory functionality in DataLoader is missing unit tests. We need to add tests that verify tensors are properly pinned...",open,4,"['good first issue', 'module: tests', 'module: dataloader']",2025-01-14T09:18:55Z,2025-01-16T08:30:22Z,https://github.com/pytorch/pytorch/issues/45067,low,low,other
44982,"Update error message for mismatched tensor sizes","When tensors have mismatched sizes in operations, the error message could be clearer. Currently shows indices, but should show actual shapes...",open,1,"['good first issue', 'module: error messages']",2025-01-12T11:45:33Z,2025-01-13T10:12:44Z,https://github.com/pytorch/pytorch/issues/44982,low,low,feature
</code></pre>
<h1 id="heading-summary"><strong>Summary</strong></h1>
<p>Using Composio and Datatune saves countless hours for engineering data engineering pipelines by abstracting away integration architecture and understanding semantic context for performing transformations on the data.</p>
]]></content:encoded></item></channel></rss>