New document (#5299)

* add new doc (#5175) Co-authored-by: dreamer6680 <146868355@qq.com> * Test docs (#5235) * fix: change the page of doc * chore: add new dependencies, update global styles/layout, optimize docs, add Feishu & GitHub icons, update API examples * fix: docs/index 404 not found * Update environment variable names, optimize styles, add new API routes, fix component styles, adjust documentation, and update GitHub and Feishu icons * update readme * feat: add a linkfastgpt compontent * feat: update new doc * fix:remove unuse page and redirect homepage to docs (#5288) * fix：remove some unuse doc * fix: redirect homepage to doc * git ignore * fix:navbar to index (#5295) * sidbar * fix: navtab unlight (#5298) * doc --------- Co-authored-by: dreamer6680 <1468683855@qq.com> Co-authored-by: dreamer6680 <146868355@qq.com>
2025-10-22 20:04:01 +00:00 · 2025-07-23 21:35:03 +08:00
parent ce9ec1bf57
commit fe7abf22a9
895 changed files with 36297 additions and 56 deletions
--- a/document/content/docs/introduction/development/design/dataset.mdx
+++ b/document/content/docs/introduction/development/design/dataset.mdx
@@ -0,0 +1,21 @@
+---
+title: 数据集
+description: FastGPT 数据集中文件与数据的设计方案
+---
+
+## 文件与数据的关系
+
+在 FastGPT 中，文件会通过 MongoDB 的 FS 存储，而具体的数据会通过 PostgreSQL 存储，PG 中的数据会有一列 file_id，关联对应的文件。考虑到旧版本的兼容，以及手动输入、标注数据等，我们给 file_id 增加了一些特殊的值，如下：
+
+- manual: 手动输入
+- mark: 手动标注的数据
+
+注意，file_id 仅在插入数据时会写入，变更时无法修改。
+
+## 文件导入流程
+
+1. 上传文件到 MongoDB 的 FS 中，获取 file_id，此时文件标记为 `unused` 状态
+2. 浏览器解析文件，获取对应的文本和 chunk
+3. 给每个 chunk 打上 file_id
+4. 点击上传数据：将文件的状态改为 `used`，并将数据推送到 mongo `training` 表中等待训练
+5. 由训练线程从 mongo 中取数据，并在获取向量后插入到 pg。