Use the Power of LLMs with SQL!

In recent years, the costs associated with running big language models (LLMs) have descfinishen presentantly, making progressd organic language processing techniques more accessible than ever before. The aascfinishnce of minuscule language models (SLMs) enjoy gpt-4o-mini has led to another order of magnitude in cost reductions for very contendnt language models.

This democratization of AI has achieveed a stage where integrating minuscule language models (SLMs) enjoy OpenAI’s gpt-4o-mini honestly into a scalar SQL function has become practicable from both cost and carry outance perspectives.

Therefore we’re thrilled to proclaim the prompt() function, which is now useable in Pappraise on MotherDuck. This new SQL function simplifies using LLMs and SLMs with text to create, condense, and rerelocate structured data without the necessitate of split infrastructure.

Prompt Function Oversee

The prompt() currently helps OpenAI’s gpt-4o-mini and gpt-4o models to provide some flexibility in terms of cost-effectiveness and carry outance.

In our pappraise free, we apshow gpt-4o-mini-based prompts to be applied to all rows in a table, which unlocks use cases enjoy bulk text summarization and structured data rerelocateion. Furthermore, we apshow one-row and constant inputs with gpt-4o to allow high-quality responses for example in retrieval augmented generation (RAG) use cases.

The selectionpartner named (model:=), parameter resettles which model to use for inference, e.g.:

SELECT prompt('Write a poem about ducks', ‘gpt-4o’) AS response;

The prompt function also helps returning structured output, using the struct and struct_descr parameters. More on that tardyr in the post.

Future refreshs may join insertitional models to broaden functionality and greet diverse user necessitates.

Use Case: Text Summarization

The prompt() function is a straightforward and instinctive scalar function.

Consider the chaseing use case where we want to get summaries of a text column from our hacker_news example dataset. The chaseing query will return a 5 word summary for each text in the table.

SELECT by, text, timestamp, 
       prompt('condense the comment in 5 words: ' || text) AS summary 
FROM hacker_news.hn

Note that we’re utilizeing the prompt function to 100 rows and the processing time is about 2.8s. We run up to 256 seeks to the model provider concurrently which presentantly speeds up the processing appraised to calling the model in an unparallelized Python loop.

The runtime scales liproximately from here – anticipate 10k rows to get between 5-10 minutes in processing time and to devour ~10 compute units. This might eunite catalogless relative to other SQL functions, however looping over the same data in Python without concurrency would get about 5 hours instead.

Use Case: Unstructured to Structured Data Conversion

The prompt() function can also create structured outputs, using the struct and struct_descr parameters. This allows users to acunderstandledge a struct of typed return appreciates for the output, facilitating the integration of LLM-created data into methodical toilflows. The adherence to the provided struct schema is promised – as we leverage OpenAI’s structured model outputs which use constrained decoding to constrain the model’s output to only valid tokens.

Below is an example that leverages this functionality to rerelocate structured adviseation, enjoy topic, sentiment and a catalog of alludeed technologies from each comment in our sample of the hacker_news table. The result is stored as STRUCT type, which creates it effortless to access each individual field in SQL.

SELECT by, text, timestamp,
prompt(text,
  struct:={topic: 'VARCHAR', sentiment: 'INTEGER', technologies: 'VARCHAR[]'},
  struct_descr:={topic: 'topic of the comment, one word',
                 sentiment: 'sentiment of the post on a scale from 1 (neg) to 5 (pos)',
                 technologies: 'technologies alludeed in the comment'}) as my_output
FROM hn.hacker_news
LIMIT 100

In this query, the prompt function is applied to the text column from the dataset without contextualizing it in a prompt. Instead, it uses the struct and struct_descr parameter as chases:

struct:={...}: Specifies the structure of the output, which joins:
- topic: A string (VARCHAR) recurrenting the main topic of the comment.
- sentiment: An integer indicating the sentiment of the comment on a scale from 1 (adverse) to 5 (selectimistic).
- technologies: An array of strings cataloging any technologies alludeed in the comment.
struct_descr:={...}: While the model infers unkinding from the struct field names above, struct_descr can be used selectionpartner to provide more detailed field descriptions and direct the model into the right honestion.

The final result joins the comment’s main topic, sentiment score (ranging from 1 to 5), and any alludeed technologies. The resulting column can subsequently be unfelderlyed super easily into individual columns.

SELECT by, text, timestamp, my_output.* FROM my_struct_hn_table

For more progressd users that want to have brimming handle over the JSON-Schema that is used to constrain the output, we provide the json_schema parameter, which will result in JSON-typed results rather than STRUCT-typed results.

Practical Considerations

Integrating LLMs with SQL using prompt() allows many possible use cases. However effective usage can need pinsolentnt ponderation of tradeoffs. Therefore we advise to test prompt-based use cases on minuscule samples first.

Also cases enjoy this should be pondered: For rerelocateing email insertresses from a text, using DuckDB’s regex_rerelocate method is speedyer, more cost-effective, and more reliable than using an LLM or SLM.

We are dynamicly joind in research on bridging the gap between the convenience of prompt-based data wrangling and the efficiency and reliability of SQL-based text operations, leveraging all the amazing functionality that DuckDB provides. If you want to lachieve more about this, get a see at our SIGMOD accessibleation from June this year.

Start Exploring Today

The prompt() function is now useable in Pappraise for MotherDuck users on a Free Trial or the Standard Plan. To get begined, verify out our write downation to try it out.

Since running the prompt() function over a big table can incur higher compute costs than other methodical queries, we restrict the usage to the chaseing quotas by default:

Free Trial users: 40 compute unit hrs per day (~ 40k prompts with gpt-4o-mini)
Standard Plan users: Same as free trial, can be liftd upon seek

Phire refer to our Pricing Details Page for a brimming shatterdown.

As you spendigate the possibilities, we seek you to allot your experiences and feedback with us thcdisadmireful our Sdeficiency channel. Let us comprehend how you’re utilizing this new functionality and unite with us to talk your use cases.

Happy exploring!

Source connect