mirror of
https://github.com/labring/FastGPT.git
synced 2026-05-06 01:02:54 +08:00
4b24472106
* docs(i18n): translate batch 1 * docs(i18n): translate batch 2 * docs(i18n): translate batch 3 (20 files) - openapi/: app, share - faq/: all 8 files - use-cases/: index, external-integration (5 files), app-cases (4 files) Translated using North American style with natural, concise language. Preserved MDX syntax, code blocks, images, and component imports. * docs(i18n): translate protocol docs * docs(i18n): translate introduction docs (part 1) * docs(i18n): translate use-cases docs * docs(i18n): translate introduction docs (part 2 - batch 1) * docs(i18n): translate final 9 files * fix(i18n): fix YAML and MDX syntax errors in translated files - Add quotes to description with colon in submit_application_template.en.mdx - Remove duplicate Chinese content in translate-subtitle-using-gpt.en.mdx - Fix unclosed details tag issue * docs(i18n): translate all meta.json navigation files * fix(i18n): translate Chinese separators in meta.en.json files * translate * translate * i18n --------- Co-authored-by: archer <archer@archerdeMac-mini.local> Co-authored-by: archer <545436317@qq.com>
22 lines
1021 B
Plaintext
22 lines
1021 B
Plaintext
---
|
|
title: Dataset Design
|
|
description: FastGPT dataset file and data design
|
|
---
|
|
|
|
## Relationship Between Files and Data
|
|
|
|
In FastGPT, files are stored using MongoDB's GridFS, while the actual data is stored in PostgreSQL. Each row in PG has a `file_id` column that references the corresponding file. For backward compatibility and to support manual input and annotated data, `file_id` has some special values:
|
|
|
|
- manual: Manually entered data
|
|
- mark: Manually annotated data
|
|
|
|
Note: `file_id` is only written at data insertion time and cannot be modified afterward.
|
|
|
|
## File Import Process
|
|
|
|
1. Upload the file to MongoDB GridFS and obtain a `file_id`. The file is marked as `unused` at this point.
|
|
2. The browser parses the file to extract text and chunks.
|
|
3. Each chunk is tagged with the `file_id`.
|
|
4. Click upload: the file status changes to `used`, and the data is pushed to the mongo `training` collection to await processing.
|
|
5. The training thread pulls data from mongo, generates vectors, and inserts them into PG.
|