Save Sentence-BERT output vectors to a file, a simple guide

Table of Contents

The Problem: Time-Consuming Re-computation
The Solution: Saving and Loading Embeddings
1. Method 1: Using pickle
2. Method 2: Using blosc2 for Optimized NumPy Array Storage

When working with Sentence-BERT models to generate embeddings for a large corpus of text, a common challenge arises: the encoding process can be incredibly time-consuming. If you have a dataset with hundreds of thousands of sentences, re-generating these embeddings every time you run your script is highly inefficient. This guide will walk you through how to save these valuable vectors to a file and load them back when needed, saving you significant processing time.

The Problem: Time-Consuming Re-computation

Imagine you’re using Sentence-BERT to get embeddings for a vocabulary list, perhaps for tasks like semantic similarity. Your initial code might look something like this:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-large-uncased-whole-word-masking')
words = [ "Artificial intelligence", "Data mining", "Political history", "Literature book"]

# A very large list of words/sentences
Vocabs = [ "Winter flooding", "Cholesterol diet", "Machine learning ethics", ...] # Potentially 500,000+ items

# This line can take a very long time for large Vocabs
Vocabs_embeddings = model.encode(Vocabs)

Running model.encode(Vocabs) on a large dataset can take a considerable amount of time. The core issue is the need to avoid this repetitive computation.

The Solution: Saving and Loading Embeddings

The most straightforward way to address this is to save the computed embeddings to a file the first time you generate them. Then, for subsequent runs, you can simply load these pre-computed embeddings.

Method 1: Using `pickle`

Python’s built-in pickle module is a convenient way to serialize Python objects, including your embedding vectors (which are typically NumPy arrays).

Saving Embeddings:
You can save your sentences and their corresponding embeddings in a dictionary to a .pkl file.

import pickle
from sentence_transformers import SentenceTransformer
import os # For checking file existence

# Assuming 'model' is your loaded SentenceTransformer model
# and 'Vocabs' is your list of sentences.
embedding_cache_path = "vocabs-embeddings.pkl"

if not os.path.exists(embedding_cache_path):
    print("Encoding the corpus. This might take a while...")
    # Vocabs = ["Your", "list", "of", "many", "sentences"] # Define your Vocabs list
    Vocabs_embeddings = model.encode(Vocabs, show_progress_bar=True, convert_to_numpy=True)

    print("Storing embeddings on disc...")
    with open(embedding_cache_path, "wb") as fOut:
        pickle.dump({'sentences': Vocabs, 'embeddings': Vocabs_embeddings}, fOut)
    print(f"Embeddings saved to {embedding_cache_path}")
else:
    print(f"Loading pre-computed embeddings from {embedding_cache_path}...")
    with open(embedding_cache_path, "rb") as fIn:
        cache_data = pickle.load(fIn)
        Vocabs = cache_data['sentences']
        Vocabs_embeddings = cache_data['embeddings']
    print("Embeddings loaded.")

# Now you can use Vocabs_embeddings

In this approach, the code first checks if an embedding file already exists. If not, it encodes the sentences and saves the embeddings. Otherwise, it loads the existing embeddings directly from the file.

Method 2: Using `blosc2` for Optimized NumPy Array Storage

For potentially better compression and speed, especially with large NumPy arrays, the blosc2 library is an excellent alternative.

First, you’ll need to install it:
pip install blosc2

Saving Embeddings with blosc2:

import blosc2
import numpy as np # Assuming embeddings are NumPy arrays
from sentence_transformers import SentenceTransformer
import os

# model = SentenceTransformer('bert-large-uncased-whole-word-masking')
# Vocabs = [ "Winter flooding", "Cholesterol diet", ...]
outfp_embedding = "vocabs-embeddings.bl2"

if not os.path.exists(outfp_embedding):
    print("Encoding the corpus with blosc2. This might take a while...")
    Vocabs_embeddings = model.encode(Vocabs, show_progress_bar=True, convert_to_numpy=True)

    print(f"Saving embeddings to {outfp_embedding} using blosc2...")
    blosc2.save_array(Vocabs_embeddings, outfp_embedding)
    print("Embeddings saved.")
else:
    print(f"Loading pre-computed embeddings from {outfp_embedding} using blosc2...")
    Vocabs_embeddings = blosc2.load_array(outfp_embedding)
    print("Embeddings loaded.")

# Now Vocabs_embeddings is ready for use

This method involves saving the NumPy array directly using blosc2.save_array() and loading it back with blosc2.load_array(). This can be particularly beneficial for very large embedding sets due to blosc2‘s efficient handling of numerical data.

By implementing either of these methods, you can significantly speed up your development and deployment cycles by avoiding the costly step of re-encoding your text data every single time. Choose the method that best fits your needs regarding ease of use and performance for your specific dataset size.

Categorized in:

AI Bug Machine Learning Python

Tagged in:

blosc2, Hugging Face, keyword extraction, load embeddings, MLflow, NLP, numpy array, pickle, Python, save embeddings, semantic search, sentence embeddings, Sentence-BERT, SentenceTransformers, text embeddings.

Save Sentence-BERT output vectors to a file, a simple guide

The Problem: Time-Consuming Re-computation

The Solution: Saving and Loading Embeddings

Method 1: Using `pickle`

Method 2: Using `blosc2` for Optimized NumPy Array Storage

Leave a Reply Cancel reply

Other Stories

[Fixed] Error with LangChain ChatPromptTemplate.from_messages

[Fixed] requests.exceptions.HTTPError: 403 Client Error: Forbidden for url

Create Bijective Map Type in C++

[Fixed] “Component auth has not been registered yet” on app launch

[Fixed] Error: The file A has been modified by A on A

[Fixed] throw new TypeError(Missing parameter name at ${i}: ${DEBUG_URL});

Create Bijective Map Type in C++

[Fixed] “Component auth has not been registered yet” on app launch

[Fixed] Error: The file A has been modified by A on A

It Looks Like You Have AdBlocker Enabled

To disable ad blocker on this site:

Press ESC to close

Or check our Popular Categories...

The Problem: Time-Consuming Re-computation

The Solution: Saving and Loading Embeddings

Method 1: Using pickle

Method 2: Using blosc2 for Optimized NumPy Array Storage

Leave a Reply Cancel reply

Related Articles

Other Stories

[Fixed] Error with LangChain ChatPromptTemplate.from_messages

[Fixed] requests.exceptions.HTTPError: 403 Client Error: Forbidden for url

It Looks Like You Have AdBlocker Enabled

To disable ad blocker on this site:

Method 1: Using `pickle`

Method 2: Using `blosc2` for Optimized NumPy Array Storage