Dataemon: AI-Powered Data Assistant Chatbot (LLM)

Present

Description: Building an LLM data analysis chatbot that allows users to upload a spreadsheet (CSV/XLSX), ask questions about their dataset, and receive statistical insights, conduct statistical tests, and create visualizations and summarization reports. Concept inspired by Doraemon.

Code, Notebook and Results:

  • Github: [in progress]

  • Jupyter Notebook: [in progress]

  • HTLM: [in progress]

Motivation & Project Overview

🚀 Goal: Create a LLM data analysis chatbot that lets users interact with their data without writing code.

This project is about creating a Neural Network-based Data Analysis Assistant that allows users to upload a spreadsheet (CSV/XLSX), ask questions about their dataset, and receive statistical insights or analysis.

The assistant will use Natural Language Processing (NLP) + Data Processing Layers to:

  1. Understand what the user wants (e.g., "Find the mean of column X").

  2. Process the request (calculate mean, median, correlation, create graphs/visualizations, etc.).

  3. Format the output in a human-readable way.

  4. Retain context from past user interactions to generate chat summaries and improve query understanding.

Example User Interaction

User: Uploads sales_data.xlsx

👥User: What is the average revenue?

📊Dataemon: The mean revenue for the dataset is $35,000.

👥User: Is there a correlation between revenue and ad spend?

📊Dataemon: There is a strong positive correlation (0.85) between 'Ad Spend' and 'Revenue'. This suggests that as ad spending increases, revenue tends to increase as well.

Here’s a scatter plot visualizing this correlation: [Visualization: Scatter plot of Revenue vs. Ad Spend with trendline]

👥User: Find the highest revenue and in which month it was recorded.

📊Dataemon: The highest revenue was $50,000, recorded in March.

Here’s a bar chart showing monthly revenue trends: [Visualization: Bar chart with Revenue per Month]

👥User: How many customers were acquired each month?

📊Dataemon: The number of customers per month is as follows:

  • January: 320

  • February: 400

  • March: 520

Here’s a line chart showing the customer growth trend: [Visualization: Line chart showing Customer Growth over Months]

👥User: Can you summarize our chat so far?

📊Dataemon: 📑Conversation Summary:

✔ Key Metrics:

  • Average Revenue: $35,000

  • Highest Revenue: $50,000 (March)

  • Customer Growth: 320 → 520 (steady increase)

  • Ad Spend vs. Revenue: Strong positive correlation (0.85)

📊 Generated Visuals:

  • 1️⃣ Scatter Plot: Ad Spend vs. Revenue (Strong correlation)

  • 2️⃣ Bar Chart: Monthly Revenue

  • 3️⃣ Line Chart: Customer Growth

💡 Key Takeaway:

  • Increased Ad Spend has likely contributed to higher Revenue and Customer Growth. March saw the highest performance across all metrics. Further analysis could explore cost-efficiency per customer acquisition.

Would you like a detailed report or additional insights? 😊


Current Implementation Plan

[project is in progress, thus subject to change]

1️⃣ NLP Layer – Understanding User Queries

This component interprets natural language queries and converts them into structured commands.

  • Approach: Use a fine-tuned NLP model (e.g. Transformer model like BERT/T5, or OpenAI’s GPT-2.0).

  • Libraries: spaCy, NLTK, Transformers (Hugging Face), LangChain

  • Steps:

    • Train/fine-tune a model to understand statistical terms and operations (mean, median, correlation, trend, etc.).

    • Use Named Entity Recognition (NER) to extract relevant column names.

    • Convert queries into structured operations (e.g., "Find the mean of column X"df['X'].mean()).

2️⃣ Data Processing Layer – Executing Analysis

Once the query is understood, this layer processes the dataset to generate insights.

  • Approach: Use Pandas + NumPy for processing structured data.

  • Libraries: pandas, numpy, scipy, statsmodels

  • Steps:

    • Convert structured query to a Pandas operation.

    • Perform calculations (mean, median, correlation, regression, etc.).

    • Perform statistical tests / advanced analysis (Chi-squared, t-test, clustering, etc.)

    • Optimize for large datasets using Dask or Vaex if needed.

3️⃣ Visualization Engine – Generating Graphs

This layer generates visualizations based on queries.

  • Approach: Matplotlib & Seaborn for rendering charts dynamically.

  • Libraries: matplotlib, seaborn, plotly

  • Steps:

    • Identify the best visualization type for the query (scatter plot for correlation, bar chart for trends, line chart for growth).

    • Generate and serve the plot dynamically.

📌 Example: If the query is "Show me revenue trends", generate a line chart of revenue over time.

4️⃣ Context & Memory – Chat Summarization & Recall

This layer remembers previous interactions and can generate chat summaries or recall user queries.

  • Approach: Use Vector Databases + LangChain Memory

  • Libraries: ChromaDB, FAISS, Pinecone, LangChain

  • Steps:

    • Store chat history and previous queries in a vector database.

    • Retrieve past interactions when generating summaries.

    • Summarize the conversation using an LLM-powered summarization model.

📌 Example: If the user asks, "Summarize our chat", use an LLM to generate a structured summary from stored interactions.

5️⃣ Deployment – Making the Chatbot Interactive

Build a backend API to handle requests and a frontend UI for users.

  • Backend: FastAPI or Flask (to process user queries & execute operations)

  • Frontend: Streamlit, Gradio, or a React-based UI for interactivity

  • Database: SQLite/PostgreSQL (for storing chat history & datasets)

📌 Example Deployment Flow:

  1. User uploads a dataset.

  2. Query is sent to NLP Engine → Converted into structured operation.

  3. Data Processing Engine executes the request.

  4. Visualization Engine generates graphs.

  5. Memory Engine stores context for future interactions.

Notes:

  • Building the UI (Part 5) is something I might explore later, but right now, I’m focused on Parts 1-3—developing the neural network, testing different models, and figuring out the best way to turn user questions into Pandas operations.

  • Instead of using a pre-built code generator like CodeGen, which just translates text into Python code, I want my AI to actually understand the data. The goal isn’t to just spit out code—it’s to provide real insights, remember past queries, and suggest the right visualizations. Basically, I’m building something that thinks through the problem, not just writes a script.

  • The project was inspired by Doraemon, my favorite childhood character. He always had these fun, magical gadgets in his pocket to help Nobita whenever he was in trouble. Growing up, I wanted to have my own Doraemon and that inspired this project — creating my own mini AI to help with data analysis, hence the name: Doraemon + data = Dataemon.

Project currently under construction…

Next
Next

Detecting Sarcasm with AI: Building an LSTM Model for News Headlines