Cesar Salcedo

Smol-Data: Your Autonomous Data Scientist

Jun 24, 2023

A project by Michael Equi, Yi Ding, and myself; presented at the Agents Hackathon organized by AGI House.
This project presents an AI data scientist that can perform generic data analysis tasks automatically. After loading a dataset, one can send a series of prompts for data manipulation, including data visualization. In this article we show a few examples of tasks performed by our assistant.
First of all, the approach consists of resolving the task in a multi-step, hierarchical process. The agent first identifies the task, then reads the dataset (columns and some rows) and its metadata, if available. It then generates and executes code in Python to achieve the task. In case there is an error, it uses the error message to correct its output.
As an illustrative example, we use the Adult Income Dataset from Kaggle to test the assistant. We start by downloading the dataset locally, which includes information like in the image below.
Adult income dataset.
Adult income dataset.
After we start running the program we can ask the asistant to perform different tasks. For example, we can prompt it to make a bar chart of the average years of education by workclass. The result is shown below.
Average years of education by workclass.
Average years of education by workclass.
Another example is asking for the distribution of workclass vs. level of education. The result is shown below.
Distribution of workclass vs. level of education.
Distribution of workclass vs. level of education.

References