Embedding Search
Introduction
See the section on Embedding Search for a primer on the use case and the underlying technology of embedding search.
Our Model
The premiere benchmark for open source model performance is the Hugging Face MTEB leaderboard. The leaderboard is a collection of models that have been fine-tuned on a variety of tasks and datasets. The leaderboard is a great resource for finding the best model for a specific use case and is available here.
The model we use for embedding search is BAAI/bge-base-en-v1.5
available here. This is the 3rd most downloaded feature extraction model on Hugging Face (September 2024) and is the most popular model for its size. It is a reliable and well tested model produced by the Beijing Academy of Artificial Intelligence (BAAI). It has 109 million parameters, which is a good balance between performance and speed, and produces 768-dimensional embeddings.
Performance
The TrueState platform will generate embeddings at approximately 500 records per second on our multi-GPU infrastructure. Generation speed is unaffected by the number of columns in the dataset, as we calculate the embedding based on a single column.
At inference time, the platform can search through 1 million rows to find the top 100 closest similarity records in approximately 5 seconds.