When working with Sentence-BERT models to generate embeddings for a large corpus of text, a common challenge arises: the encoding process can be incredibly time-consuming. If you have a dataset with hundreds of thousands of sentences, re-generating these embeddings every time you run your script is highly inefficient. This guide will walk you through how to save these valuable vectors to a file and load them back when needed, saving you significant processing time.

The Problem: Time-Consuming Re-computation

Imagine you’re using Sentence-BERT to get embeddings for a vocabulary list, perhaps for tasks like semantic similarity. Your initial code might look something like this:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-large-uncased-whole-word-masking')
words = [ "Artificial intelligence", "Data mining", "Political history", "Literature book"]

# A very large list of words/sentences
Vocabs = [ "Winter flooding", "Cholesterol diet", "Machine learning ethics", ...] # Potentially 500,000+ items

# This line can take a very long time for large Vocabs
Vocabs_embeddings = model.encode(Vocabs)
Python

Running model.encode(Vocabs) on a large dataset can take a considerable amount of time. The core issue is the need to avoid this repetitive computation.

The Solution: Saving and Loading Embeddings

The most straightforward way to address this is to save the computed embeddings to a file the first time you generate them. Then, for subsequent runs, you can simply load these pre-computed embeddings.

Method 1: Using pickle

Python’s built-in pickle module is a convenient way to serialize Python objects, including your embedding vectors (which are typically NumPy arrays).

Saving Embeddings:
You can save your sentences and their corresponding embeddings in a dictionary to a .pkl file.

import pickle
from sentence_transformers import SentenceTransformer
import os # For checking file existence

# Assuming 'model' is your loaded SentenceTransformer model
# and 'Vocabs' is your list of sentences.
embedding_cache_path = "vocabs-embeddings.pkl"

if not os.path.exists(embedding_cache_path):
    print("Encoding the corpus. This might take a while...")
    # Vocabs = ["Your", "list", "of", "many", "sentences"] # Define your Vocabs list
    Vocabs_embeddings = model.encode(Vocabs, show_progress_bar=True, convert_to_numpy=True)

    print("Storing embeddings on disc...")
    with open(embedding_cache_path, "wb") as fOut:
        pickle.dump({'sentences': Vocabs, 'embeddings': Vocabs_embeddings}, fOut)
    print(f"Embeddings saved to {embedding_cache_path}")
else:
    print(f"Loading pre-computed embeddings from {embedding_cache_path}...")
    with open(embedding_cache_path, "rb") as fIn:
        cache_data = pickle.load(fIn)
        Vocabs = cache_data['sentences']
        Vocabs_embeddings = cache_data['embeddings']
    print("Embeddings loaded.")

# Now you can use Vocabs_embeddings
Python

In this approach, the code first checks if an embedding file already exists. If not, it encodes the sentences and saves the embeddings. Otherwise, it loads the existing embeddings directly from the file.

Method 2: Using blosc2 for Optimized NumPy Array Storage

For potentially better compression and speed, especially with large NumPy arrays, the blosc2 library is an excellent alternative.

First, you’ll need to install it:
pip install blosc2

Saving Embeddings with blosc2:

import blosc2
import numpy as np # Assuming embeddings are NumPy arrays
from sentence_transformers import SentenceTransformer
import os

# model = SentenceTransformer('bert-large-uncased-whole-word-masking')
# Vocabs = [ "Winter flooding", "Cholesterol diet", ...]
outfp_embedding = "vocabs-embeddings.bl2"

if not os.path.exists(outfp_embedding):
    print("Encoding the corpus with blosc2. This might take a while...")
    Vocabs_embeddings = model.encode(Vocabs, show_progress_bar=True, convert_to_numpy=True)

    print(f"Saving embeddings to {outfp_embedding} using blosc2...")
    blosc2.save_array(Vocabs_embeddings, outfp_embedding)
    print("Embeddings saved.")
else:
    print(f"Loading pre-computed embeddings from {outfp_embedding} using blosc2...")
    Vocabs_embeddings = blosc2.load_array(outfp_embedding)
    print("Embeddings loaded.")

# Now Vocabs_embeddings is ready for use
Python

This method involves saving the NumPy array directly using blosc2.save_array() and loading it back with blosc2.load_array(). This can be particularly beneficial for very large embedding sets due to blosc2‘s efficient handling of numerical data.

By implementing either of these methods, you can significantly speed up your development and deployment cycles by avoiding the costly step of re-encoding your text data every single time. Choose the method that best fits your needs regarding ease of use and performance for your specific dataset size.