Utilizing SQL-based AI functions with Starburst

Last Updated: 2025-12-08

Background

This Utilizing SQL-based AI functions with Starburst tutorial assumes you are conceptually familiar with LLMs, vector embeddings, and RAG. The following video is an appropriate primer for this tutorial.

Scope of the tutorial

In this tutorial, you will learn by example how Starburst integrates with popular AI models to leverage the power of LLMs with SQL. You will learn a variety of specialized tasks and implement the RAG pattern. You will use Starburst Galaxy as the environment for this exploration.

Learning objectives

Once you've completed this tutorial, you will be able to:

Configure LLM and embedding AI models in the Starburst UI.
Use AI task functions for sentiment analysis, classification, grammar correction, text masking, and translation needs.
Leverage the AI embedding function to create vector embeddings from text.
Alter an Iceberg table to store vector embeddings within.
Build a simple RAG workflow leveraging Iceberg tables and Starburst AI functions.

Prerequisites

Access to a Starburst Galaxy environment; see Starburst Galaxy: Getting started.
A data lake catalog configured with write access to an underlying object store; see Configure a Starburst Galaxy data lake catalog and schema.
A valid OpenAI API key from OpenAI, or any OpenAI compatible service; see Developer quickstart - OpenAI API. AWS Bedrock models are also supported in Starburst Galaxy, but not leveraged in this tutorial.

Tutorial video

As an option, you can watch a video version of this tutorial.

Log into Starburst

Log into your Starburst Galaxy environment with the following configuration.

Leverage a user in the accountadmin role.
A cluster named aws-us-east-1-free.
This cluster has a catalog named tmp_cat assigned to it.

Navigate to Query > Query editor from the left nav, open a new editor tab, name it AI functions, and choose the aws-us-east-1-free cluster from the first pulldown menu.

Create a schema

Run the following SQL to create, and set the editor to use, a new schema.

CREATE schema tmp_cat.ai_functions;
USE tmp_cat.ai_functions;

Notice the final pulldown shows the ai_functions schema is selected which will allow us to reference tables without the fully qualified catalog_name.schema_name prefix.

Select AI models from the left nav and then click on Connect to an external model.

This presents a simple input form.

Configure an embedding model

Select Open AI & Open AI API Compatible for Model integration and Embedding for Model type. For Model ID enter a valid model such as text-embedding-3-small.

Set the Model alias to ai_functions_embeding and optionally add a Description.

Leave the OpenAI endpoint to its default value and enter your OpenAI API key value.

Press the Test connection button at the lower right of the configuration screen to see the following confirmation message.

Press the Create button to then see the newly created model in the AI models list.

Configure an LLM model

Using the Connect to an external model button again, then select Open AI & Open AI API Compatible for Model integration and LLM for Model type. For Model ID enter a valid model such as gpt-5-nano-2025-08-07. Enter ai_functions_gpt5 for the Model alias and paste in your OpenAI API key value.

Press the Test connection button and after the confirmation message is presented, press the Create button to then see the newly created model in the AI models list.

As you can see from Starburst | AI functions the functions are broken into 3 categories.

Task functions such as analyzing sentiment, classification, fixing grammar, masking text, and translating languages.
Embedding functions are all about creating mathematical representations of text that can be clumped together near similar text (see my video back in the overview step).
Prompt functions — well, there really is only one function. It is called prompt() and it does what it sounds like; you can send in an input prompt and get a response back from an LLM (think ChatGPT) with the answer.

Submit a prompt

Let's ask a very important question.

SELECT starburst.ai.prompt(
    'What underlying tech does Starburst use?',
    'ai_functions_gpt5');

Explore the task functions

Since Trino / Starburst users are familiar with SQL and possibly haven't seen themselves as ‘prompt engineers', a library of built-in, pre-tested functions was created that work with any configured LLM. Let's explore them.

Sentiment analysis

Verify a very obvious positive statement.

SELECT starburst.ai.analyze_sentiment(
    'I love Starburst',
    'ai_functions_gpt5');

Let's be a bit more wordy to make sure it is doing a good job.

SELECT starburst.ai.analyze_sentiment(
    'Trino/Presto/Athena whatever the branding is, is the most impressive data processing technology I have seen in my career. The only downside is that SQL is its only interface, but honestly I love it so much. I have been lucky enough to meet some of engineers that worked on it and they are some of the best engineers I have met in my career. If I was going to a Greenfield project, I would 100% want this as the main part of our stack.',
    'ai_functions_gpt5');

Classification

If this was sent to me in email would my spam detector flag it?

SELECT starburst.ai.classify(
    'Trino/Presto/Athena whatever the branding is, is the most impressive data processing technology I have seen in my career. The only downside is that SQL is its only interface, but honestly I love it so much. I have been lucky enough to meet some of engineers that worked on it and they are some of the best engineers I have met in my career. If I was going to a Greenfield project, I would 100% want this as the main part of our stack.',
    ARRAY['spam', 'not spam'], 
    'ai_functions_gpt5');

You can try other classifications and number of options. What does the LLM think about the verbosity of this writer?

SELECT starburst.ai.classify(
    'Trino/Presto/Athena whatever the branding is, is the most impressive data processing technology I have seen in my career. The only downside is that SQL is its only interface, but honestly I love it so much. I have been lucky enough to meet some of engineers that worked on it and they are some of the best engineers I have met in my career. If I was going to a Greenfield project, I would 100% want this as the main part of our stack.',
    ARRAY['cursory', 'average', 'exhaustive'], 
    'ai_functions_gpt5');

Identify and correct grammatical errors

Image a column with some free form text that you might want to clean up a bit.

SELECT starburst.ai.fix_grammar(
    'Trino/Presto/Athena whatever the branding is, is the most impressive data processing technology I have seen in my career. The only downside is that SQL is its only interface, but honestly I love it so much. I have been lucky enough to meet some of engineers that worked on it and they are some of the best engineers I have met in my career. If I was going to a Greenfield project, I would 100% want this as the main part of our stack.',
    'ai_functions_gpt5');

Language translations

Convert the text to Chinese as used in Taiwan.

SELECT starburst.ai.translate(
    'Trino/Presto/Athena whatever the branding is, is the most impressive data processing technology I have seen in my career. The only downside is that SQL is its only interface, but honestly I love it so much. I have been lucky enough to meet some of engineers that worked on it and they are some of the best engineers I have met in my career. If I was going to a Greenfield project, I would 100% want this as the main part of our stack.',
    'zh-TW', 'ai_functions_gpt5');

You could even chain functions together to do something like correcting the grammar from the source text and then converting that to Spanish.

SELECT starburst.ai.translate(starburst.ai.fix_grammar(
    'Trino/Presto/Athena whatever the branding is, is the most impressive data processing technology I have seen in my career. The only downside is that SQL is its only interface, but honestly I love it so much. I have been lucky enough to meet some of engineers that worked on it and they are some of the best engineers I have met in my career. If I was going to a Greenfield project, I would 100% want this as the main part of our stack.', 
    'ai_functions_gpt5'), 'es', 'ai_functions_gpt5');

As this would likely call the LLM twice, you would almost certainly have faster performance with less cost by customizing the prompt directly. Here's one simple ‘prompt engineering' approach.

SELECT starburst.ai.prompt(concat(
    'Correct the grammar and spelling in the following text, and then translate the corrected text into Spanish. Here is the text: ',
    'Trino/Presto/Athena whatever the branding is, is the most impressive data processing technology I have seen in my career. The only downside is that SQL is its only interface, but honestly I love it so much. I have been lucky enough to meet some of engineers that worked on it and they are some of the best engineers I have met in my career. If I was going to a Greenfield project, I would 100% want this as the main part of our stack.'),    
    'ai_functions_gpt5');

Data masking

Starburst's built-in access controls already include the ability to define column masks, but this is only useful on structured data. The AI functions come to the rescue when you need to apply masks to data in free-form text.

Let's try it with some text containing an address and phone number.

SELECT starburst.ai.mask(
    'Contact me at 555-1234 or visit us at 123 Main St.',
    ARRAY['phone', 'address'], 
    'ai_functions_gpt5');

Imagine if you will that the cartoon character Fred Flintstone was keeping a personal diary.

Our use case

If we asked things of a public LLM about Fred's interests or experiences that are only in this diary, we will not get an accurate answer. For example, I happen to know Fred, like me, is an active Postcrossing member and sends/receives postcards all over the world.

Let's see what happens if we ask an LLM if he likes "snail mail".

SELECT starburst.ai.prompt(
  'Does Fred Flintstone like to send and receive mail?',
  'ai_functions_gpt5');

It doesn't seem that the LLM knows about Fred's personal thoughts on the matter.

His journal

Fred uses his stone-aged mobile device (hey, he's a fictional character so why not give him cellular service and internet access!) to record his journal entries. The good news for this tutorial is that we can just pick up at the point where there is an Apache Iceberg table that holds all the entries (was there an ice-age in the middle of the stone-age?).

-- create a table for diary entries
CREATE TABLE diary_entries (
    owner varchar,
    entry_date timestamp,
    entry_text varchar
);
show create table diary_entries;

Let's populate the table with some sample journal entries that we can use for testing.

-- add initial entries 
INSERT INTO diary_entries (owner, entry_date, entry_text)
VALUES
  ('Fred Flintstone', date_add('day', -99, current_timestamp), 'It is a new year and today I decided to start keeping a diary. Some may imagine that life is pretty boring here in Bedrock, but my family and friends make sure life is NEVER boring! As today is a Monday, I''m off to work at the quarry. I work there everyday until 5pm. Despite it being physical work, I do enjoy it. I was happy to play with Dino after work.'),
  ('Fred Flintstone', date_add('day', -98, current_timestamp), 'Ok, keeping a diary is something I''ll need to get used to. It seems I missed yesterday, but I''m still here. It is mid-week and I''m finally jotting down my notes this Wednesday evening. Barney and I just got back from our Loyal Order of Water Buffaloes meeting. Barney is running for Vice President this year and I fully support him.'),
  ('Fred Flintstone', date_add('day', -97, current_timestamp), 'Ahh... the weekend is finally here. I''ve been so busy (and well... I guess I simply forgot as well) to enter my thoughts into my daily journal. Off to the drive in movie tonight.'),
  ('Fred Flintstone', date_add('day', -96, current_timestamp), 'The weekends always go so fast. Family & friends continue to be the thing I spend my time on. I''m thinking I need a new hobby, too.'),
  ('Fred Flintstone', date_add('day', -95, current_timestamp), 'I have decided to try sending & receiving postcards as a new hobby. I found this website called Postcrossings that looks pretty interesting. The site says that over 800,000 people enjoy this hobby. Postcrossers, that''s what these hobbyists are called, have sent over 78 million postcards so far.'),
  ('Fred Flintstone', date_add('day', -94, current_timestamp), 'I was able to send 3 postcards today to people the system provided addresses to me for. This really is the hobby for me.'),
  ('Fred Flintstone', date_add('day', -93, current_timestamp), 'Fun Thursday night getting 2 more postcards in the mail.  I am loving this hobby even though Wilma thinks I''m nuts.'),
  ('Fred Flintstone', date_add('day', -92, current_timestamp), 'I went hiking over the weekend for one of my existing hobbies; waterfall viewing. I saw two of the largest ones in Bedrock; Big Falls and Prettybig Falls. I was even able to buy a few postcards at the gift shop.'),
  ('Fred Flintstone', date_add('day', -91, current_timestamp), 'I had so much fun looking at waterfalls that I visited the artificial one we have at the mall; Shopping Shoal. Silly name, but I did find out that shoals were usually rocks in rough water that paddlers enjoy traversing through.'),
  ('Fred Flintstone', date_add('day', -90, current_timestamp), 'This   was   a    second   journal entry for Jan 15. It shows    some   extra   spacing    that can    be cleaned up.'),
  ('Fred Flintstone', date_add('day', -89, current_timestamp), 'Woke up, fell out of bed. Dragged a comb across my head. Found my way downstairs and drank a cup and looking up, I noticed I was late. Found my coat and grabbed my hat. Made the bus in seconds flat. Found my way upstairs and had a smoke and somebody spoke and I went into a dream.'),
  ('Fred Flintstone', date_add('day', -88, current_timestamp), 'Woke up, fell out of bed. Dragged a comb across my head. Found my way downstairs and drank a cup and looking up, I noticed I was late. Found my coat and grabbed my hat. Made the bus in seconds flat. Found my way upstairs and had a smoke and somebody spoke and I went into a dream.'),
  ('Fred Flintstone', date_add('day', -87, current_timestamp), 'These last days have been a complete blur. So busy just life that I haven''t been able to focus on my new hobby of mailing postcards.');
select * from diary_entries;

A large component of many comprehensive RAG applications is the ETL pipeline that prepares data to be retrieved that can be then used to augment an LLM prompt. As unstructured docs in ai (the wild west) points out, these pipelines can get rather complicated, rather quickly, depending on the input data.

Fortunately, our simple use case didn't need to parse any unstructured data and the values in the entry_text column are appropriately chunked for the RAG workflow we are going to build. We just need to create vector embeddings of these journal entries and persist them somewhere.

If you are confused about any of the RAG terminology used above, please watch the following video.

Add a column for the embeddings

We need a place to store the vector embeddings. You could always use a purpose-built vector database, but for this tutorial we will store them "lakeside". This means as a column in our Iceberg table. Let's add a column for them to be persisted.

-- prep to store vector embeddings 
--  'lakeside' right in the Iceberg table
ALTER TABLE diary_entries 
  ADD COLUMN entry_text_embeddings array(double);

Generate & persist the embeddings

Now we can leverage generate_embeddings() to create the vector embeddings and then just store them into the new column.

-- create and persist the embeddings
UPDATE diary_entries
   SET entry_text_embeddings = 
         starburst.ai.generate_embedding(
             entry_text, 'ai_functions_embedding'
         )
 WHERE entry_text_embeddings IS NULL;
select * from diary_entries;

Investigating the embeddings

Taking a deeper look into one of the entry_text_embeddings column's values, you'll notice this is just a very long array of numbers ranging from -1 to 1. Again, the video at the top of this step attempts to give a very high-level understanding of what these are representing.

In a nutshell, this array of values from -1 to 1 is a multi-dimensional mathematical representation of the contents in the corresponding entry_text column for each row. These can be used in our RAG workflow in the next step when trying to find chunks of text (journal entries in our case) that are similar to a request that will be asked about Fred Flintstone.

An interesting approach

It is worth noticing 2 cool things we just did.

We used SQL to interact with an AI model to generate vector embeddings!
We didn't need a separate vector database to store those embeddings!! We stored them "lakeside" to use the phrase coined by Starburst.

Now that we have our data prepared, we can focus on the SQL-based RAG workflow we are interested in.

The "R"

RAG starts with "R" so let's see if we can use Starburst's AI functions to RETRIEVE journal entries that align with the earlier use case description.

-- Retrieve the top 5 most relevant journal
--  entries based on semantic similarity
WITH vector_search AS(
    SELECT
        owner, entry_date, entry_text,
        cosine_similarity(
            starburst.ai.generate_embedding(
                'Does Fred Flintstone like to send and receive mail?', 
                'ai_functions_embedding'
            ),
            entry_text_embeddings
        ) AS similarity_score
    FROM diary_entries
    ORDER BY similarity_score DESC
    LIMIT 5
)
SELECT * FROM vector_search;

Notice that the entry_text values below with the highest similarity_score values reference postcards. Our question didn't mention postcards specifically, but that's the power of similarity searching based on embeddings, not just a simple LIKE operator. The cosine_similarity() function already existed in Trino.

Prep for the "A"

Let's get ready for the AUGMENTATION bit of RAG by preparing the journal entries into a JSON collection that the LLM will more easily consume. We are just creating a second Common Table Expression (CTE) that leverages some more existing Trino functions.

-- Retrieve the top 5 most relevant journal
--  entries based on semantic similarity
WITH vector_search AS(
    SELECT
        owner, entry_date, entry_text,
        cosine_similarity(
            starburst.ai.generate_embedding(
                'Does Fred Flintstone like to send and receive mail?', 
                'ai_functions_embedding'
            ),
            entry_text_embeddings
        ) AS similarity_score
    FROM diary_entries
    ORDER BY similarity_score DESC
    LIMIT 5
),
-- Augment the results by converting  
--  them into a JSON object
json_results AS (
SELECT CAST(map_agg(to_iso8601(entry_date), json_object(
    key 'journal entry date' VALUE entry_date,
    key 'journal entry text' VALUE entry_text)) AS JSON) AS json_data
FROM
    vector_search
)
SELECT * FROM json_results;

That looks like a bunch of gobbledygook, but if you format it in your favorite editor it looks a bit more understandable.

{
    "2025-08-28T22:42:53.688000": "{\"journal entry date\":\"2025-08-28 22:42:53.688000\",\"journal entry text\":\"It is a new year and today I decided to start keeping a diary. Some may imagine that life is pretty boring here in Bedrock, but my family and friends make sure life is NEVER boring! As today is a Monday, I'm off to work at the quarry. I work there everyday until 5pm. Despite it being physical work, I do enjoy it. I was happy to play with Dino after work.\"}",
    "2025-09-01T22:42:53.688000": "{\"journal entry date\":\"2025-09-01 22:42:53.688000\",\"journal entry text\":\"I have decided to try sending & receiving postcards as a new hobby. I found this website called Postcrossings that looks pretty interesting. The site says that over 800,000 people enjoy this hobby. Postcrossers, that's what these hobbyists are called, have sent over 78 million postcards so far.\"}",
    "2025-09-02T22:42:53.688000": "{\"journal entry date\":\"2025-09-02 22:42:53.688000\",\"journal entry text\":\"I was able to send 3 postcards today to people the system provided addresses to me for. This really is the hobby for me.\"}",
    "2025-09-03T22:42:53.688000": "{\"journal entry date\":\"2025-09-03 22:42:53.688000\",\"journal entry text\":\"Fun Thursday night getting 2 more postcards in the mail.     I am loving this hobby even though Wilma thinks I'm nuts.\"}",
    "2025-09-09T22:42:53.688000": "{\"journal entry date\":\"2025-09-09 22:42:53.688000\",\"journal entry text\":\"These last days have been a complete blur. So busy just life that I haven't been able to focus on my new hobby of mailing postcards.\"}"
}

The "AG"

Finally, you can use the json_results CTE generated from the last step to AUGMENT our prompt() function call to GENERATE a more appropriate answer.

-- Retrieve the top 5 most relevant journal
--  entries based on semantic similarity
WITH vector_search AS(
    SELECT
        owner, entry_date, entry_text,
        cosine_similarity(
            starburst.ai.generate_embedding(
                'Does Fred Flintstone like to send and receive mail?', 
                'ai_functions_embedding'
            ),
            entry_text_embeddings
        ) AS similarity_score
    FROM diary_entries
    ORDER BY similarity_score DESC
    LIMIT 5
),
-- Augment the results by converting  
--  them into a JSON object
json_results AS (
SELECT CAST(map_agg(to_iso8601(entry_date), json_object(
    key 'journal entry date' VALUE entry_date,
    key 'journal entry text' VALUE entry_text)) AS JSON) AS json_data
FROM
    vector_search
)
-- Generate an augmented response 
--  using the LLM
SELECT
    starburst.ai.prompt(concat(
        'Using the list of journal entries provided in JSON, ',
        'Does Fred Flintstone like to send and receive mail?',
        json_format(json_data)), 'ai_functions_gpt5')
FROM json_results;

Check out the results this time!!

I thought he did!

Let's try asking some more questions with the simple SELECT statement leveraging prompt() as well as the more lengthy RAG workflow statement using CTEs.

Question

How does Fred Flintstone feel about public transportation?

Base LLM response

Enhanced RAG response

Question

What hobbies or activities is Fred Flintstone interested in?

Base LLM response

Enhanced RAG response

Better prompt engineering would merge these answers together in the RAG response.

Question

How often does Fred Flintstone go to the movies?

Base LLM response

It is kind of funny that the animated theme song does show him going to the movies!

Enhanced RAG response

How did it go?

It is fair to say that even with this very primitive RAG workflow (again, written using SQL!), the responses are better after being augmented with non-public information.

Delete the schema

Run the following SQL to remove the schema and all the objects within it. This will also clear up the object store files that were created during this tutorial.

DROP SCHEMA ai_functions CASCADE;

Update the cluster

Click on Admin > Clusters from the left nav.

Find the aws-us-east-1-free cluster in the presented list and click on the vertical ellipsis menu icon on the far right of this row. In the menu that surfaces, click on Edit cluster.

In the Edit cluster configuration screen, click on the Catalogs pulldown menu.

Uncheck the box on the for the tmp_cat (us-east-1) item in the list that surfaces, then click on Save changes.

Remove the catalog

Navigate to Data > Catalogs from the left nav to see the list of Catalogs. Type tmp_cat in the search box to trim the list to only this catalog. Click on the cluster name link.

Click on the vertical ellipsis menu icon and choose Delete catalog.

Follow the on-screen instructions to enable, and click, the Yes, delete button.

Congratulations, you've completed the Utilizing SQL-based AI functions with Starburst tutorial!

What you learned

How to configure LLM and embedding AI models in the Starburst UI.
How to use AI task functions for sentiment analysis, classification, grammar correction, text masking, and translation needs.
How to leverage the AI embedding function to create vector embeddings from text.
How to alter an Iceberg table to store vector embeddings within.
How to build a simple RAG workflow leveraging Iceberg tables and Starburst AI functions.

What's next?

Check out some of these codelabs...

Background

Scope of the tutorial

Learning objectives

Prerequisites

Tutorial video

Log into Starburst

Create a schema

Configure an embedding model

Configure an LLM model

Submit a prompt

Explore the task functions

Sentiment analysis

Classification

Identify and correct grammatical errors

Language translations

Data masking

Our use case

His journal

Add a column for the embeddings

Generate & persist the embeddings

Investigating the embeddings

An interesting approach

The "R"

Prep for the "A"

The "AG"

Question

Base LLM response

Enhanced RAG response

Question

Base LLM response

Enhanced RAG response

Question

Base LLM response

Enhanced RAG response

How did it go?

Delete the schema

Update the cluster

Remove the catalog

What you learned

What's next?

Further reading

Reference docs