Embedding Search

Introduction

See the section on Embedding Search for a primer on the use case and the underlying technology of embedding search.

Our Model

The premiere benchmark for open source model performance is the Hugging Face MTEB leaderboard. The leaderboard is a collection of models that have been fine-tuned on a variety of tasks and datasets. The leaderboard is a great resource for finding the best model for a specific use case and is available here.

The model we use for embedding search is BAAI/bge-base-en-v1.5 available here. This is the 3rd most downloaded feature extraction model on Hugging Face (September 2024) and is the most popular model for its size. It is a reliable and well tested model produced by the Beijing Academy of Artificial Intelligence (BAAI). It has 109 million parameters, which is a good balance between performance and speed, and produces 768-dimensional embeddings.

Performance

The TrueState platform will generate embeddings at approximately 500 records per second on our multi-GPU infrastructure. Generation speed is unaffected by the number of columns in the dataset, as we calculate the embedding based on a single column.

At inference time, the platform can search through 1 million rows to find the top 100 closest similarity records in approximately 5 seconds.

Embedding Search

Introduction

Our Model

Performance

Use Cases

Company Search

Documentation Search

Introduction​

Our Model​

Performance​

Use Cases​

Company Search​

Documentation Search​

Introduction

Our Model

Performance

Use Cases

Company Search

Documentation Search