Skip to main content

Introduction to Embedding Search

A Brief History

In the early days of search engines, simple methods like matching exact word matching or counting word frequency were commonly used. These systems relied on exact word matches between user queries and documents, which often led to missed results if the wording didn’t align perfectly. For example, a search for "car" might not retrieve documents that only used the word "automobile." These methods were efficient but struggled to capture deeper semantic relationships between words.

As search technology evolved, machine learning models like Word2Vec [1] introduced a more sophisticated approach by embedding words as vectors in a multi-dimensional space. This allowed search systems to recognize semantic similarities between different terms, such as understanding that "car" and "automobile" are related, even if the exact words don’t match. This shift marked the beginning of vector search, where queries and documents are matched based on meaning, rather than relying solely on keywords.

Word2Vec

Further advancements in deep learning, particularly with transformer models like BERT [2], brought even greater sophistication. BERT’s ability to generate context-aware embeddings enabled it to better understand the meanings of words in their specific context, improving search relevance. With models like Sentence-BERT [3], which fine-tuned BERT for sentence-level embeddings, search engines could more deeply analyze and match queries with documents at a conceptual level, making vector search even more accurate and effective.

Embedding search is grounded in the mathematical concept of representing words, sentences, or documents as vectors in a multi-dimensional space. This is typically done by using machine learning models like Word2Vec, BERT, or Sentence-BERT, which transform textual data into fixed-length vector representations. Each word, sentence, or document is embedded in a high-dimensional space, where semantically similar text appears close together. These vectors are numerical arrays, where each element represents a dimension in the space, and the values capture various features and relationships between the words. By representing the text in this way, embedding search allows comparisons between vectors that capture not just the literal words but their contextual meaning as well.

Embedding vector

The most common method for comparing vectors in embedding search is cosine similarity. Cosine similarity measures the cosine of the angle between two vectors, indicating their orientation in the multi-dimensional space. Mathematically, this is computed by taking the dot product of the two vectors and dividing it by the product of their magnitudes (or Euclidean norms). The result is a value between -1 and 1, where 1 means the vectors are identical (pointing in the same direction), 0 means they are orthogonal (no similarity), and -1 means they are diametrically opposed. In the context of embedding search, a higher cosine similarity score between two vectors indicates greater semantic similarity, which allows the system to rank and retrieve documents that are conceptually closest to the query.

On the TrueState platform, we represent the cosine similarity as a metric called 'Similarity Score', which provides a measure of how similar the query is to a record. The higher the similarity score, the more similar the query and the record.

Cosine

Modern applications of embedding search are widespread, leveraging the ability to retrieve results based on semantic meaning rather than simple keyword matches. This approach is widely used in natural language search, where search engines, chatbots, and virtual assistants interpret user queries more accurately by understanding the context and meaning behind words. In recommendation systems, embedding search plays a key role by analyzing user preferences and suggesting products, movies, or articles that align with their interests, even when the exact terms are not mentioned. This semantic capability allows systems to deliver highly personalized results, enhancing the overall user experience across platforms.

One specific use case of embedding search is in semantic search for customer support systems. In these systems, users submit queries or problems in natural language, and embedding search helps match their queries with the most relevant answers from a knowledge base. For instance, a user might describe an issue with a product in their own words, such as "my phone isn’t charging properly," and the system, powered by embeddings, can understand the underlying meaning and match it to support articles that discuss solutions for charging issues, even if the wording is different. This reduces the dependency on exact keyword matching and allows users to get more accurate, relevant responses faster. Embedding-based search also enables customer support systems to handle a variety of related queries by clustering semantically similar content, enhancing the efficiency and quality of automated help systems.

Using Embedding Search in TrueState

The TrueState platform provides the framework to easily and quickly generate embeddings for a dataset, and to then search that dataset using natural language.

Embeddings can be generated by using the Embeddings Workflow Template. See the corresponding Tutorial for further detail. To search the dataset using natural language, simply create a Dashboard for the dataset, and the embedding search function will be already available.

References

  1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.

  2. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171-4186).

  3. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (pp. 3982-3992).