Wednesday, June 3, 2026

Textual content Summarization with Scikit-LLM – MachineLearningMastery.com

On this article, you’ll learn to use scikit-LLM’s textual content summarization function to deal with giant volumes of textual content in machine studying pipelines.

Subjects we’ll cowl embrace:

  • How one can construct a customized scikit-learn-compatible transformer that wraps a Hugging Face summarization mannequin.
  • How one can combine LLM-driven textual content summarization right into a scikit-learn Pipeline for knowledge preprocessing.
  • How one can chain summarization, TF-IDF vectorization, and a classifier right into a single end-to-end pipeline.

Textual content Summarization with Scikit-LLM
Picture by Editor

Introduction

In a earlier put up, we launched scikit-LLMa library that bridges the hole between conventional machine studying fashions and fashionable giant language fashions (LLMs). Specifically, we showcased how one can implement zero-shot and few-shot classification use instances with scikit-LLM.

Now, we try and reply the query: What if our downstream machine studying use case is hampered by large quantities of textual content? To handle this problem, we’ll discover and use summarizers: one other highly effective function of this library that distills lengthy texts into succinct summaries. Let’s see how, by implementing a knowledge preparation pipeline that includes this course of!

Preliminary Setup

Step one is to be sure to have scikit-LLM put in — exchange “pip” with “!pip” in case you are working in a cloud pocket book setting:

Observe that by default, scikit-LLM resorts to OpenAI language fashions, which will be costly to run repeatedly, or whose variety of makes use of could also be very restricted beneath a free OpenAI account. Alternatively, you need to use free Hugging Face pre-trained fashions for summarization, like sshleifer/distilbart-cnn-12-6. In such a case, be sure to additionally set up Hugging Face’s Transformers library, to have the ability to load Hugging Face fashions in your program.

LLM-Pushed Textual content Summarization Pipeline

The next class definition encompasses the logic to load a pre-trained mannequin (match()) and apply inference on it, i.e. summarize enter texts (remodel()):

Importantly, the category we outlined inherits from customized transformer lessons: a crucial step to make sure Hugging Face fashions combine easily with scikit-learn preprocessing and modeling instruments.

For simplicity, say we’ll solely summarize two textual content opinions which can be half of a bigger dataset for textual content classification. The 2 “lengthy” texts (options) and the opinions’ sentiments (labels) may appear to be:

The actual magic occurs subsequent. We outline a pipeline that brings collectively our knowledge preprocessing — specifically, LLM-driven summarization — and the coaching of a classifier. In an actual situation, you’ll need way over two coaching examples to construct a correct classifier, after all, however the level right here is for example how textual content summarization can cut back the dimensionality of textual content knowledge:

As soon as the pipeline has been outlined, right here’s how one can run it:

That’s all! Attempt adapting the code above to an actual, labeled textual content dataset for binary sentiment classification, and see the way it works in apply.

Earlier than we wrap up, in case you are inquisitive about what the summarized texts appear to be, you possibly can examine the output immediately:

The summaries are, after all, removed from the standard you’d get from ChatGPT or Google Gemini — the mannequin we used is a free, light-weight pre-trained mannequin, in spite of everything. That stated, selecting extra highly effective fashions will definitely yield higher outcomes.

Abstract

We bridged the hole between traditional machine studying modeling and superior textual content processing by way of pre-trained giant language fashions, because of scikit-LLM: a library that leverages the very best of each worlds.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles