How to Use Scikit-LLM for Text Analysis With Large Language Models

Scikit-LLM is a Python package that helps integrate large language models (LLMs) into the scikit-learn framework. It helps in accomplishing text analysis tasks. If you are familiar with scikit-learn, it will be easier for you to work with Scikit-LLM.

It is important to note that Scikit-LLM does not replace scikit-learn. scikit-learn is a general-purpose machine learning library but Scikit-LLM is specifically designed for text analysis tasks.

OpenAI API key page

Getting Started With Scikit-LLM

To get started withScikit-LLM, you’ll need to install the library and configure your API key. To install the library, open your IDE andcreate a new virtual environment. This will help prevent any potential library version conflicts. Then, run the following command in the terminal.

This command will install Scikit-LLM and its required dependencies.

OpenAI API key generation page

To configure your API key, you need to acquire one from your LLM provider. To obtain the OpenAI API key, follow these steps:

Proceed to theOpenAI API page. Then click on your profile located in the upper-right corner of the window. SelectView API keys. This will take you to theAPI keyspage.

Scikit-LLM rate limit error

On theAPI keyspage, click on theCreate new secret keybutton.

Name your API key and click on theCreate secret keybutton to generate the key. After generation, you need to copy the key and store it in a safe place as OpenAI will not display the key again. If you lose it, you will need to generate a new one.

The full source code is available in aGitHub repository.

Now that you have your API key, open your IDE and importSKLLMConfigclass from the Scikit-LLM library. This class allows you to set configuration options related to the usage of large language models.

This class expects you to set your OpenAI API key and organization details.

Zero-hot text classification report

The organization ID and the name are not the same. Organization ID is a unique identifier of your organization. To obtain your organization ID, proceed to theOpenAI Organizationsettings page and copy it. You have now established a connection between Scikit-LLM and the large language model.

Scikit-LLM requires you to have a pay-as-you-go plan. This is because the free trial OpenAI account has a rate limit of three requests per minute which is not sufficient for Scikit-LLM.

Scikit-LLM multi label classification report

Trying to use the free trial account will lead to an error similar to the one below while performing text analysis.

To learn more about rate limits. Proceed to theOpenAI rate limits page.

The LLM provider is not limited to only OpenAI. You can use other LLM providers as well.

Importing the Required Libraries and Loading the Dataset

Import pandas which you will use to load the dataset. Also, from Scikit-LLM and scikit-learn, import the required classes.

Next, load the dataset you want to perform text analysis on. This code uses the IMDB movies dataset. You can however tweak it to use your own dataset.

Using only the first 100 rows of the dataset is not mandatory. you’re able to use your entire dataset.

Next, extract the features and label columns. Then split your dataset into train and test sets.

TheGenrecolumn contains the labels you want to predict.

Zero-Shot Text Classification With Scikit-LLM

Zero-shot text classification is a feature offered by large language models. It classifies text into predefined categories without the need for explicit training on labeled data. This capability is very useful when dealing with tasks where you need to classify text into categories you didn’t anticipate during model training.

To perform zero-shot text classification using Scikit-LLM, use theZeroShotGPTClassifierclass.

The output is as follows:

The classification report provides metrics for each label that the model is trying to predict.

Multi-Label Zero-Shot Text Classification With Scikit-LLM

In some scenarios, a single text may belong to multiple categories simultaneously. Traditional classification models struggle with this. Scikit-LLM on the other hand makes this classification possible. Multi-label zero-shot text classification is crucial in assigning multiple descriptive labels to a single text sample.

UseMultiLabelZeroShotGPTClassifierto predict which labels are appropriate for each text sample.

In the code above, you define the candidate labels that your text might belong to.

The output is as shown below:

This report helps you understand how well your model is performing for each label in multi-label classification.

Text Vectorization With Scikit-LLM

In text vectorization textual data is converted into a numerical format that machine learning models can understand. Scikit-LLM offers the GPTVectorizer for this. It allows you to transform text into fixed-dimensional vectors using GPT models.

You can achieve this using the Term Frequency-Inverse Document Frequency.

Here is the output:

The output represents the TF-IDF vectorized features for the first 5 samples in the dataset.

Text Summarization With Scikit-LLM

Text summarization helps in condensing a piece of text while preserving its most critical information. Scikit-LLM offers the GPTSummarizer, which usesthe GPT modelsto generate concise summaries of text.

The above is a summary of the test data.

Build Applications on Top of LLMs

Scikit-LLM opens up a world of possibilities for text analysis with large language models. Understanding the technology behind large language models is crucial. It will help you understand their strengths and weaknesses that can assist you in building efficient applications on top of this cutting-edge technology.

Getting Started With Scikit-LLM#

On theAPI keyspage, click on theCreate new secret keybutton.#

The full source code is available in aGitHub repository.#

Importing the Required Libraries and Loading the Dataset#

TheGenrecolumn contains the labels you want to predict.#

Zero-Shot Text Classification With Scikit-LLM#

To perform zero-shot text classification using Scikit-LLM, use theZeroShotGPTClassifierclass.#

The output is as follows:#

Multi-Label Zero-Shot Text Classification With Scikit-LLM#

UseMultiLabelZeroShotGPTClassifierto predict which labels are appropriate for each text sample.#

The output is as shown below:#

Text Vectorization With Scikit-LLM#

You can achieve this using the Term Frequency-Inverse Document Frequency.#

Here is the output:#

Text Summarization With Scikit-LLM#

The above is a summary of the test data.#

Build Applications on Top of LLMs#