Retrieval Metrics
Performing a query on a VectorDB results in one or more embeddings being returned. The total number of returned embeddings is referred to as "k". In addition, each returned embedding has a k value determined by the order in which it was returned in the query response - e.g. first, second, third and so on. The k value of a returned embedding directly impacts several key aspects of the system:
Performance: Identifying the number of embeddings to return from the VectorDB for each query is directly related to the performance of the system. If your testing metrics indicate that your best response is always within k = 3, then why return more? This is often difficult to determine without a broad set of test queries to validate the embeddings. This leads to the question of query efficiency.
Efficiency: A low "k" value ensures faster querying and lower resource consumption, but might miss out on relevant results. Conversely, a high "k" offers more comprehensive retrieval but can be slower and more computationally expensive. Developers need to balance speed and relevance when choosing "k" for their specific use case. Ranking & User Experience: The order of retrieved items within the "k" results is crucial for user experience. Developers can leverage ranking algorithms to prioritize the most relevant items within the limited "k" space, ensuring users see the most pertinent information first.
Error Analysis & Optimization: Analyzing retrieved items, both relevant and irrelevant, within the "k" set is invaluable for debugging and refining the retrieval model. Developers can learn which features contribute to misinterpretations and use this knowledge to improve the model's accuracy and precision. Therefore, understanding and optimizing "k" is key for developers to build efficient, accurate, and user-friendly AI retrieval systems.
Row Level Metrics
Row level metrics that provide granular insights into the performance of each query in your retrieval evaluation. You can now see the accuracy, precision, recall, MAP, MRR, and NDCG for individual queries, enabling you to pinpoint areas for improvement and analyze query-specific performance.
In addition to the overall metrics, we offer a detailed view that shows which queries were returned for each evaluation. This allows you to examine the specific queries and their corresponding results, facilitating a deeper understanding of your retrieval system's behavior.
To further enhance your analysis, we have implemented filtering capabilities based on row level metrics. You can set threshold values or ranges for a chosen metric, such as accuracy or recall, and see the percentage of queries that meet the specified criteria. This feature helps you identify queries that perform exceptionally well or those that require further optimization.
Moreover, our search functionality enables you to focus on a specific subset of queries. By entering relevant keywords or phrases, you can narrow down the query results and assess the performance of queries containing those terms. The percentage of queries meeting the set threshold within the search results is displayed, providing a targeted view of query performance.
Accuracy
This is an overall score indicating the likelihood of a correct response based on precision and recall. High accuracy does not mean that the ideal response was necessarily returned. High accuracy does indicate that the system is predictably likely to return relevant items. Accuracy is calculated based on Precision and Recall.
- Interpretation
- Limitations
- High accuracy indicates the system mostly retrieves relevant items.
- Low accuracy signifies many irrelevant items are returned.
- Sensitive to the definition of "relevant."
- Not always the best metric for ranking quality.
Precision
The percentage of retrieved items that are relevant regardless of how many could have been returned.
If k = 4 and there are 2 relevant embeddings amongst the 4 returned, then the precision is 50%. The recall may be significantly lower if there were more possible relevant items.
- Interpretation
- Limitations
- High precision indicates the system mostly avoids retrieving irrelevant items.
- Low precision signifies many irrelevant items are included in the results.
- Focuses on retrieved items only, ignoring potentially relevant items not retrieved.
- Can be misleading with small result sets.
Recall
Percentage of relevant items that are retrieved by the system based on the universe of possible relevant items that could have been retrieved.
If 2 relevant embeddings are returned out of a universe of 10 possible correct embeddings, then the recall is 20%.
- Interpretation
- Limitations
- High recall indicates the system retrieves most relevant items.
- Low recall signifies many relevant items are missed.
- Can be maximized by returning all items, losing ranking information.
- Not as informative as other metrics when dealing with large result sets.
NDCG (Normalized Discounted Cumulative Gain)
NDCG is an important metric when evaluating result ranking. It measures the ranking quality of retrieved items, considering their relevance and position in the result list.
NDCG scores are lower when the better return items are later in the result set (e.g. with a higher k value). If the first returned value (k = 1) was the ideal embedding, then the NDCG score would be 1.0.
- Interpretation
- Advantages
- Higher NDCG values indicate better ordering of relevant items, with higher ranked items being more relevant.
- Lower NDCG values suggest poorly ordered results.
- Accounts for both relevance and ranking.
- Penalizes irrelevant items based on their position.
MRR (Mean Reciprocal Rank)
Average reciprocal rank of the first relevant item in the result list.
MRR is an important metric when evaluating result ranking. The score is calculated by taking the mean of the reciprocal k value of the ideal embedding. For example if the ideal embedding occurs second then MRR would be 1/2 or 0.5. If instead it came first, then it would be 1.0 and if came third, it would be 0.33 or 1/3.
- Interpretation
- Advantages
- Higher MRR values indicate relevant items are found at higher positions on average.
- Lower MRR values suggest relevant items are buried deeper in the list.
- Simple and easy to understand.
- Robust to changes in result list lengths.
MAP (Mean Average Precision)
MAP is an important metric when evaluating result ranking. It is calculated as the average precision across all possible cutoff points in the result list.
Low MAP results may indicate that there are numerous very similar and potentially difficult to distinguish between embeddings that fit the type of query being used. High MAP scores indicate that the embeddings are well aligned to the query and that they will consistently return an acceptable answer.
- Interpretation
- Advantages
- Higher MAP values indicate consistent precision across the entire ranking.
- Lower MAP values suggest precision fluctuates significantly with rank.
- Provides a comprehensive overview of ranking quality.
- Less sensitive to the number of irrelevant items.
Try it ...
To run your first set of retrieval evaluations on Okareo, follow our Retrieval Testing guide.