With a few clicks, you can complete complex data analysis; the MIT team introduc

A new tool allows database users to perform complex statistical analysis on tabular data more easily without needing to understand the underlying mechanisms.

GenSQL, this database generative AI system, can help users complete predictions, anomaly detection, missing value imputation, error correction, or generate synthetic data with just a few clicks.

For example, if the system is used to analyze the medical data of a patient with consistently high blood pressure, it can capture blood pressure readings that are low for that specific patient but usually fall within the normal range.

GenSQL automatically integrates tabular datasets with generative probabilistic AI models, which can consider uncertainty and adjust their decision-making process based on new data.

In addition, GenSQL can be used to generate and analyze synthetic data that mimics real data in a simulated database, which is particularly useful for situations where sensitive data (such as patient health records) cannot be shared or when real data is sparse.This new tool is built upon the foundation of SQL, a programming language for database creation and manipulation that has been introduced since the late 1970s and is used by millions of developers worldwide.

Advertisement

"Historically, SQL taught the business world what computers can do. They didn't have to write custom programs; they could just ask questions of the database in a high-level language. We believe that as we shift from merely querying data to asking questions of models and data, we will need a similar language to teach people how to ask coherent questions to computers with data probability models," said Vikash Mansinghka, the head of the Probabilistic Computing Project at the MIT Department of Brain and Cognitive Sciences and the senior author.

When researchers compared GenSQL with popular artificial intelligence data analysis methods, they found that it is not only faster but also more accurate in results. Importantly, the probabilistic models used by GenSQL are interpretable, and users can read and edit these models.

The main author of the paper, Mathieu Huot, a researcher from the Department of Brain and Cognitive Sciences and the Probabilistic Computing Project, added: "Observing data using only some simple statistical rules and trying to find meaningful patterns may miss important interactions. What you really want to do is capture the correlations and dependencies between variables in a model, which can be quite complex. With GenSQL, we want to enable a large number of users to query their data and models without having to understand all the details."

The paper also involved MIT graduate students Matin Ghavami and Alexander Lew, researcher Cameron Freer, Ulrich Schaechtel and Zane Shelby from Digital Garage, Professor Martin Rinard from the Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), and Assistant Professor Feras Saad from Carnegie Mellon University. This research was recently presented at the ACM Conference on Programming Language Design and Implementation.Integrating Models with Databases

SQL (Structured Query Language) is a programming language used for storing and manipulating information in databases. Through SQL, people can ask questions about data using keywords (such as summarizing, filtering, or grouping database records).

However, query models can provide deeper insights because models can capture the meaning of data for individuals. For example, a female developer who wants to know if her salary is too low may be more concerned about what the salary data means for her personally, rather than the trends in the database records.

Researchers have noted that SQL does not provide an effective way to integrate probabilistic AI models, while at the same time, methods for inference using probabilistic models do not support complex database queries.

They built GenSQL to fill this gap, allowing users to query datasets and probabilistic models with a direct and powerful formal programming language.Users upload their data and probabilistic models to GenSQL, and the system automatically integrates this information. Subsequently, users can run queries that are influenced by the probabilistic models running in the background. This not only allows for more complex queries but also provides more accurate answers.

For example, a query in GenSQL might be: "What is the likelihood that developers in Seattle are familiar with the Rust programming language?" If one were to only look at the correlations between columns in the database, subtle dependencies might be overlooked. However, integrating probabilistic models can capture more complex interactions.

Moreover, the probabilistic models used by GenSQL are auditable, allowing people to see the data used by the models for decision-making. Additionally, these models provide a measure of calibrated uncertainty for each answer.

For instance, with this calibrated uncertainty, if a user asks the model to predict the outcome of cancer treatment for a minority group (underrepresented in the dataset), GenSQL will inform the user of its level of uncertainty, rather than overconfidently recommending an incorrect treatment method.Faster and More Accurate Results

To evaluate GenSQL, researchers compared their system with popular neural network baseline methods. GenSQL's speed is 1.7 to 6.8 times faster than these methods, executing most queries within a few milliseconds while providing more accurate results.

They also applied GenSQL in two case studies: one system identified incorrect labels in clinical trial data, and the other generated accurate synthetic data capturing complex relationships in genomics.

Next, researchers hope to apply GenSQL more broadly for large-scale population modeling. With GenSQL, they can generate synthetic data to control the information used in the analysis while making inferences about health and salary matters.

They also want to make GenSQL more user-friendly and powerful by adding new optimizations and automation features. In the long run, researchers hope to enable users to ask questions in GenSQL using natural language, with the goal of eventually developing an AI expert similar to ChatGPT, with which users can converse about any database, and its answers will be based on GenSQL queries.The research was funded in part by the Defense Advanced Research Projects Agency (DARPA), Google, and the Siegel Family Foundation.