FastGPT/document/content/docs/introduction/guide/knowledge_base/dataset_engine.en.mdx

---
title: Knowledge Base Search Methods and Parameters
description:
  This section covers FastGPT's knowledge base architecture, including its QA
  storage format and multi-vector mapping, to help you build better knowledge bases. It also explains each search parameter. This guide focuses on practical usage rather than in-depth theory.
---

## Understanding Vectors

FastGPT uses an Embedding-based RAG approach for its knowledge base. To use FastGPT effectively, you need a basic understanding of how `Embedding` vectors work and their characteristics.

Human text, images, videos, and other media cannot be directly understood by computers. To determine whether two pieces of text are similar or related, they typically need to be converted into a computer-readable format — vectors are one such method.

A vector is essentially an array of numbers. The "distance" between two vectors can be calculated using mathematical formulas — the smaller the distance, the more similar the vectors. This maps back to text, images, videos, and other media to measure similarity between them. Vector search leverages this principle.

Since text comes in many types with countless combinations, exact matching is hard to guarantee when converting to vectors for similarity comparison. In vector-based knowledge bases, a `top-k` recall approach is typically used — finding the top `k` most similar results and passing them to an LLM for further `semantic evaluation`, `logical reasoning`, and `summarization`, enabling knowledge base Q&A. This makes vector search the most critical step in the process.

Many factors affect vector search accuracy, including: vector model quality, data quality (length, completeness, diversity), and retriever precision (the speed vs. accuracy tradeoff). Search query quality is equally important.

Retriever precision is relatively straightforward to address, and training vector models is more complex, so optimizing data and query quality becomes a key focus.


### Improving Vector Search Accuracy

1. Better tokenization and chunking: When a text segment has complete and singular structure and semantics, accuracy improves. Many systems optimize their tokenizers to preserve data completeness.
2. Streamline `index` content by reducing vector content length: Shorter, more precise `index` content improves search accuracy, though it may narrow the search scope. Best suited for scenarios requiring strict answers.
3. Increase `index` quantity: Add multiple `index` entries for the same `chunk` to improve recall.
4. Optimize search queries: In practice, user questions are often vague or incomplete. Refining the query (search term) can significantly improve accuracy.
5. Fine-tune vector models: Off-the-shelf vector models are general-purpose and may underperform in specific domains. Fine-tuning can greatly improve domain-specific search results.

## FastGPT Knowledge Base Architecture

### Data Storage Structure

In FastGPT, a knowledge base consists of three parts: libraries, collections, and data entries. A collection can be thought of as a "file." A library can contain multiple collections, and a collection can contain multiple data entries. The smallest searchable unit is the library — searches span the entire library. Collections are only for organizing and managing data and do not affect search results (at least for now).

![](/imgs/dataset_tree.png)

### Vector Storage Structure

FastGPT uses `PostgreSQL`'s `PG Vector` extension as the vector retriever, with `HNSW` indexing. `PostgreSQL` is used solely for vector search (this engine can be swapped for other databases), while `MongoDB` handles all other data storage.

In `MongoDB`'s `dataset.datas` collection, vector source data is stored along with an `indexes` field that records corresponding vector IDs. This is an array, meaning a single data entry can map to multiple vectors.

In `PostgreSQL`, a `vector` field stores the vectors. During search, vectors are recalled first, then their IDs are used to look up the original data in `MongoDB`. If multiple vectors map to the same source data, they are merged and the highest vector score is used.

![](/imgs/datasetSetting1.png)

### Purpose and Usage of Multi-Vector Mapping

In a single vector, content length and semantic richness are often at odds. FastGPT uses multi-vector mapping to map a single data entry to multiple vectors, preserving both data completeness and semantic richness.

You can add multiple vectors to a longer text so that if any one vector is matched during search, the entire data entry is recalled.

This means you can continuously improve data chunk accuracy through annotation.

### Search Pipeline

1. Use `Query Optimization` for coreference resolution and query expansion, improving multi-turn conversation search capability and semantic richness.
2. Use `Concat Query` to improve `Rerank` accuracy during multi-turn conversations.
3. Use `RRF` (Reciprocal Rank Fusion) to merge results from multiple search channels.
4. Use `Rerank` for secondary sorting to improve precision.

![](/imgs/dataset_search_process.png)


## Search Parameters
| | | |
| --- |---| --- |
|![](/imgs/dataset_search_params1.png)| ![](/imgs/dataset_search_params2.png) | ![](/imgs/dataset_search_params3.png) |

### Search Modes

#### Semantic Search

Semantic search calculates the vector distance between the user's query and knowledge base content to determine "similarity" — mathematical similarity, not linguistic.

Pros:
- Understands similar semantics
- Cross-language understanding (e.g., Chinese query matching English content)
- Multimodal understanding (text, images, audio/video, etc.)

Cons:
- Depends on model training quality
- Inconsistent accuracy
- Affected by keywords and sentence completeness

#### Full-Text Search

Uses traditional full-text search. Best for finding key subjects, predicates, and other specific terms.

#### Hybrid Search

Combines vector search and full-text search, merging results using the RRF formula. Generally produces richer and more accurate results.

Since hybrid search covers a large range and cannot directly filter by similarity, a rerank model is typically used to re-sort results and filter by rerank scores.

#### Result Reranking

Uses a `ReRank` model to re-sort search results. In most cases, this significantly improves accuracy. Rerank models work better with complete questions (with proper subjects and predicates), so query optimization is usually applied before search and reranking. Reranking produces a score between `0-1` representing the relevance between the search content and the query — this score is typically more accurate than vector similarity scores and can be used for filtering.

FastGPT uses `RRF` to merge rerank results, vector search results, and full-text search results into the final output.

### Search Filters

#### Reference Limit

The maximum number of `tokens` to reference per search.

Instead of using `top k`, we found that in mixed knowledge bases (Q&A + document), different `chunk` lengths vary significantly, making `top k` results unstable. Using a `token` limit provides more consistent control.

#### Minimum Relevance

A value between `0-1` that filters out low-relevance search results.

This only takes effect when using `Semantic Search` or `Result Reranking`.

### Query Optimization

#### Background

In RAG, we need to perform embedding searches against the database based on the input query to find similar content (i.e., knowledge base search).

During search — especially in multi-turn conversations — follow-up questions often fail to find relevant content because knowledge base search only uses the "current" question. Consider this example:

![](/imgs/coreferenceResolution2.webp)

When the user asks "What's the second point?", the system searches for "What's the second point?" in the knowledge base, which returns nothing useful. The actual query should be "What is the QA structure?". This is why we need a Query Optimization module to complete the user's current question, enabling the knowledge base search to find relevant content. Here's the result after optimization:

![](/imgs/coreferenceResolution3.webp)

#### How It Works

Before performing `data retrieval`, the model first performs `coreference resolution` and `query expansion`. This resolves ambiguous references and enriches the query's semantic content. You can view the optimized query in the conversation details after each interaction.