Files
FastGPT/packages/service/core/dataset/read.ts
T
Archer 76d6234de6 V4.14.7 features (#6406)
* Agent features (#6345)

* Test agent (#6220)

* squash: compress all commits into one

* feat: plan response in ui

* response ui

* perf: agent config

* merge

* tool select ux

* perf: chat ui

* perf: agent editform

* tmp code

* feat: save chat

* Complete agent parent  (#6049)

* add role and tools filling

* add: file-upload

---------

Co-authored-by: xxyyh <2289112474@qq>

* perf: top agent code

* top agent (#6062)

Co-authored-by: xxyyh <2289112474@qq>

* fix: ts

* skill editor ui

* ui

* perf: rewrite type with zod

* skill edit ui

* skill agent (#6089)

* cp skill chat

* rebase fdf933d
 and add skill chat

* 1. skill 的 CRUD
2. skill 的信息渲染到前端界面

* solve comment

* remove chatid and chatItemId

* skill match

* perf: skill manage

* fix: ts

---------

Co-authored-by: xxyyh <2289112474@qq>
Co-authored-by: archer <545436317@qq.com>

* fix: ts

* fix: loop import

* skill tool config (#6114)

Co-authored-by: xxyyh <2289112474@qq>

* feat: load tool in agent

* skill memory (#6126)

Co-authored-by: xxyyh <2289112474@qq>

* perf: agent skill editor

* perf: helperbot ui

* agent code

* perf: context

* fix: request context

* agent usage

* perf: agent context and pause

* perf: plan response

* Test agent sigle skill (#6184)

* feat:top box fill

* prompt fix

---------

Co-authored-by: xxyyh <2289112474@qq>

* perf: agent chat ui

* Test agent new (#6219)

* have-replan

* agent

---------

Co-authored-by: xxyyh <2289112474@qq>

* fix: ts

---------

Co-authored-by: YeYuheng <57035043+YYH211@users.noreply.github.com>
Co-authored-by: xxyyh <2289112474@qq>

* feat: consolidate agent and MCP improvements

This commit consolidates 17 commits including:
- MCP tools enhancements and fixes
- Agent system improvements and optimizations
- Auth limit and prompt updates
- Tool response compression and error tracking
- Simple app adaptation
- Code quality improvements (TypeScript, ESLint, Zod)
- Version type migration to schema
- Remove deprecated useRequest2
- Add LLM error tracking
- Toolset ID validation fixes

---------

Co-authored-by: YeYuheng <57035043+YYH211@users.noreply.github.com>
Co-authored-by: xxyyh <2289112474@qq>

* fix: transform avatar copy;perf: filter invalid tool

* update llm response storage time

* fix: openapi schema

* update skill desc

* feat: cache hit data

* i18n

* lock

* chat logs support error filter & user search (#6373)

* chat log support searching by user name

* support error filter

* fix

* fix overflow

* optimize

* fix init script

* fix

* perf: get log users

* updat ecomment

* fix: ts

* fix: test

---------

Co-authored-by: archer <545436317@qq.com>

* Fix: agent  (#6376)

* Agent features (#6345)

* Test agent (#6220)

* squash: compress all commits into one

* feat: plan response in ui

* response ui

* perf: agent config

* merge

* tool select ux

* perf: chat ui

* perf: agent editform

* tmp code

* feat: save chat

* Complete agent parent  (#6049)

* add role and tools filling

* add: file-upload

---------

Co-authored-by: xxyyh <2289112474@qq>

* perf: top agent code

* top agent (#6062)

Co-authored-by: xxyyh <2289112474@qq>

* fix: ts

* skill editor ui

* ui

* perf: rewrite type with zod

* skill edit ui

* skill agent (#6089)

* cp skill chat

* rebase fdf933d
 and add skill chat

* 1. skill 的 CRUD
2. skill 的信息渲染到前端界面

* solve comment

* remove chatid and chatItemId

* skill match

* perf: skill manage

* fix: ts

---------

Co-authored-by: xxyyh <2289112474@qq>
Co-authored-by: archer <545436317@qq.com>

* fix: ts

* fix: loop import

* skill tool config (#6114)

Co-authored-by: xxyyh <2289112474@qq>

* feat: load tool in agent

* skill memory (#6126)

Co-authored-by: xxyyh <2289112474@qq>

* perf: agent skill editor

* perf: helperbot ui

* agent code

* perf: context

* fix: request context

* agent usage

* perf: agent context and pause

* perf: plan response

* Test agent sigle skill (#6184)

* feat:top box fill

* prompt fix

---------

Co-authored-by: xxyyh <2289112474@qq>

* perf: agent chat ui

* Test agent new (#6219)

* have-replan

* agent

---------

Co-authored-by: xxyyh <2289112474@qq>

* fix: ts

---------

Co-authored-by: YeYuheng <57035043+YYH211@users.noreply.github.com>
Co-authored-by: xxyyh <2289112474@qq>

* feat: consolidate agent and MCP improvements

This commit consolidates 17 commits including:
- MCP tools enhancements and fixes
- Agent system improvements and optimizations
- Auth limit and prompt updates
- Tool response compression and error tracking
- Simple app adaptation
- Code quality improvements (TypeScript, ESLint, Zod)
- Version type migration to schema
- Remove deprecated useRequest2
- Add LLM error tracking
- Toolset ID validation fixes

---------

Co-authored-by: YeYuheng <57035043+YYH211@users.noreply.github.com>
Co-authored-by: xxyyh <2289112474@qq>

* 1. 把辅助生成前端上的 system prompt 加入到上下文中
2. mcp工具的前端渲染(图标)
3. 文件读取工具和文件上传进行关联
4. 添加了辅助生成返回格式出错的重试方案
5. ask 不出现在 plan 步骤中
6. 添加了辅助生成的头像和交互 UI

* fix:read_file

* helperbot ui

* ts error

* helper ui

* delete Unused import

* perf: helper bot

* lock

---------

Co-authored-by: Archer <545436317@qq.com>
Co-authored-by: xxyyh <2289112474@qq>

* fix date variable required & model auth (#6386)

* fix date variable required & model auth

* doc

* feat: add chat id to finish callback

* fix: iphone safari shareId (#6387)

* fix: iphone safari shareId

* fix: mcp file list can't setting

* fix: reason output field

* fix: skip JSON validation for HTTP tool body with variable (#6392)

* fix: skip JSON validation for HTTP tool body with variable

* doc

* workflow fitview

* perf: selecting memory

* perf: cp api

* ui

* perf: toolcall auto adapt

* fix: catch workflow error

* fix: ts

* perf: pagination type

* remove

* ignore

* update doc

* fix: simple app tool select

* add default avatar to logs user

* perf: loading user

* select dataset ui

* rename version

* feat: add global/common test

* perf: packages/global/common test

* feat: package/global/ai,app test

* add global/chat test

* global/core test

* global/core test

* feat: packages/global all test

* perf: test

* add server api test

* perf: init shell

* perf: init4150 shell

* remove invalid code

* update doc

* remove log

* fix: chat effect

* fix: plan fake tool  (#6398)

* 1. 提示词防注入功能
2. 无工具不进入 plan,防止虚拟工具生成

* Agent-dataset

* dataset

* dataset presetInfo

* prefix

* perf: prompt

---------

Co-authored-by: xxyyh <2289112474@qq>
Co-authored-by: archer <545436317@qq.com>

* fix: review

* adapt kimi2.5 think toolcall

* feat: invoke fastgpt user info (#6403)

feat: invoke fastgpt user info

* fix: invoke fastgpt user info return orgs (#6404)

* skill and version

* retry helperbot (#6405)

Co-authored-by: xxyyh <2289112474@qq>

* update template

* remove log

* doc

* update doc

* doc

* perf: internal ip check

* adapt get paginationRecords

* tool call adapt

* fix: test

* doc

* fix: agent initial version

* adapt completions v1

* feat: instrumentation check

* rename skill

* add workflow demo mode tracks (#6407)

* chore: 统一 skills 目录命名为小写

将 .claude/Skills/ 重命名为 .claude/skills/ 以保持命名一致性。

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* add workflow demo mode tracks

* code

* optimize

* fix: improve workflowDemoTrack based on PR review

- Add comment to empty catch block for maintainability
- Add @param docs to onDemoChange clarifying nodeCount usage
- Replace silent .catch with console.debug for dev debugging
- Handle appId changes by reporting old data before re-init

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: archer <545436317@qq.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* remove repeat skill

* fix(workflow): filter out orphan edges to prevent runtime errors (#6399)

* fix(workflow): filter out orphan edges to prevent runtime errors

Runtime edges that reference non-existent nodes (orphan edges) can cause
unexpected behavior or crashes during workflow dispatch. This change adds
a pre-check to filter out such edges before execution begins, ensuring
system stability even with inconsistent graph data.

* fix(workflow): enhance orphan edge filtering with logging and tests

- Refactor: Extract logic to 'filterOrphanEdges' in utils.ts for better reusability
- Feat: Add performance monitoring (warn if >100ms) and comprehensive logging
- Feat: Support detailed edge inspection in debug mode
- Docs: Add JSDoc explaining causes of orphan edges (migration, manual edits)
- Test: Add unit tests covering edge cases and performance (1000 edges)

Addresses PR review feedback regarding logging, variable naming, and testing."

* move code

* move code

* add more unit test

---------

Co-authored-by: archer <545436317@qq.com>

* test

* perf: test

* add server/common/string test

* fix: resolve $ref references in MCP tool input schemas (#6395) (#6409)

* fix: resolve $ref references in MCP tool input schemas (#6395)

* add test code

---------

Co-authored-by: archer <545436317@qq.com>

* chore(docs): add fastgpt, fastgpt-plugin version choice guide (#6411)

* chore(doc): add fastgpt version description

* doc

* doc

---------

Co-authored-by: archer <545436317@qq.com>

* fix:dataset cite and description info (#6410)

* 1. 添加知识库引用(plan 步骤和直接知识库调用)
2. 提示词框中的@知识库工具
3. plan 中 step 的 description dataset_search 改为中文

* fix: i18n

* prompt

* prompt

---------

Co-authored-by: xxyyh <2289112474@qq>

* fix: tool call

* perf: workflow props

* fix: merge ECharts toolbox options instead of overwriting (#6269) (#6412)

* feat: integrate logtape and otel (#6400)

* fix: deps

* feat(logger): integrate logtape and otel

* wip(log): add basic infras logs

* wip(log): add request id and inject it into context

* wip(log): add basic tx logs

* wip(log): migrate

* wip(log): category

* wip(log): more sub category

* fix: type

* fix: sessionRun

* fix: export getLogger from client.ts

* chore: improve logs

* docs: update signoz and changelog

* change type

* fix: ts

* remove skill.md

* fix: lockfile specifier

* fix: test

---------

Co-authored-by: archer <545436317@qq.com>

* init log

* doc

* remove invalid log

* fix: review

* template

* replace new log

* fix: ts

* remove log

* chore: migrate all addLog to logtape

* move skill

* chore: migrate all addLog to logtape (#6417)

* update skill

* remove log

* fix: tool check

---------

Co-authored-by: YeYuheng <57035043+YYH211@users.noreply.github.com>
Co-authored-by: xxyyh <2289112474@qq>
Co-authored-by: heheer <heheer@sealos.io>
Co-authored-by: Finley Ge <32237950+FinleyGe@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: xuyafei1996 <54217479+xuyafei1996@users.noreply.github.com>
Co-authored-by: ToukoYui <2331631097@qq.com>
Co-authored-by: roy <whoeverimf5@gmail.com>
2026-02-12 16:37:50 +08:00

364 lines
9.1 KiB
TypeScript

import {
ChunkTriggerConfigTypeEnum,
DatasetSourceReadTypeEnum
} from '@fastgpt/global/core/dataset/constants';
import { urlsFetch } from '../../common/string/cheerio';
import { type TextSplitProps } from '@fastgpt/global/common/string/textSplitter';
import { axios } from '../../common/api/axios';
import { readFileContentByBuffer } from '../../common/file/read/utils';
import { parseFileExtensionFromUrl } from '@fastgpt/global/common/string/tools';
import { getApiDatasetRequest } from './apiDataset';
import Papa from 'papaparse';
import type { ApiDatasetServerType } from '@fastgpt/global/core/dataset/apiDataset/type';
import { text2Chunks } from '../../worker/function';
import { retryFn } from '@fastgpt/global/common/system/utils';
import { getFileMaxSize } from '../../common/file/utils';
import { UserError } from '@fastgpt/global/common/error/utils';
import { getS3DatasetSource, S3DatasetSource } from '../../common/s3/sources/dataset';
import { getFileS3Key, isS3ObjectKey } from '../../common/s3/utils';
import { getLogger, LogCategories } from '../../common/logger';
const logger = getLogger(LogCategories.MODULE.DATASET.FILE);
export const readFileRawTextByUrl = async ({
teamId,
tmbId,
url,
customPdfParse,
getFormatText,
relatedId,
datasetId,
maxFileSize = getFileMaxSize()
}: {
teamId: string;
tmbId: string;
url: string;
customPdfParse?: boolean;
getFormatText?: boolean;
relatedId: string; // externalFileId / apiFileId
datasetId: string;
maxFileSize?: number;
}) => {
const extension = parseFileExtensionFromUrl(url);
// Check file size
try {
const headResponse = await axios.head(url, { timeout: 10000 });
const contentLength = parseInt(headResponse.headers['content-length'] || '0');
if (contentLength > 0 && contentLength > maxFileSize) {
return Promise.reject(
`File too large. Size: ${Math.round(contentLength / 1024 / 1024)}MB, Maximum allowed: ${Math.round(maxFileSize / 1024 / 1024)}MB`
);
}
} catch (error) {
logger.warn('File HEAD request failed, skip size precheck', { url, error });
}
// Use stream response type, avoid double memory usage
const response = await axios({
method: 'get',
url: url,
responseType: 'stream',
maxContentLength: maxFileSize,
timeout: 30000
});
// 优化:直接从 stream 转换为 buffer,避免 arraybuffer 中间步骤
const chunks: Buffer[] = [];
let totalLength = 0;
return new Promise<{ rawText: string }>((resolve, reject) => {
let isAborted = false;
const cleanup = () => {
if (!isAborted) {
isAborted = true;
chunks.length = 0; // 清理内存
response.data.destroy();
}
};
// Stream timeout
const timeoutId = setTimeout(() => {
cleanup();
reject('File download timeout after 30 seconds');
}, 600000);
response.data.on('data', (chunk: Buffer) => {
if (isAborted) return;
totalLength += chunk.length;
if (totalLength > maxFileSize) {
clearTimeout(timeoutId);
cleanup();
return reject(
`File too large. Maximum size allowed is ${Math.round(maxFileSize / 1024 / 1024)}MB.`
);
}
chunks.push(chunk);
});
response.data.on('end', async () => {
if (isAborted) return;
clearTimeout(timeoutId);
try {
// 合并所有 chunks 为单个 buffer
const buffer = Buffer.concat(chunks);
// 立即清理 chunks 数组释放内存
chunks.length = 0;
const { rawText } = await retryFn(() => {
const { fileParsedPrefix } = getFileS3Key.dataset({
datasetId,
filename: 'file'
});
return readFileContentByBuffer({
customPdfParse,
getFormatText,
extension,
teamId,
tmbId,
buffer,
encoding: 'utf-8',
imageKeyOptions: {
// TODO: 链接解析出来的图片不过期,删除知识库时候也需要一起删
prefix: fileParsedPrefix
}
});
});
resolve({ rawText });
} catch (error) {
cleanup();
reject(error);
}
});
response.data.on('error', (error: Error) => {
clearTimeout(timeoutId);
cleanup();
reject(error);
});
response.data.on('close', () => {
clearTimeout(timeoutId);
cleanup();
});
});
};
/*
fileId - local file, read from mongo
link - request
externalFile/apiFile = request read
*/
export const readDatasetSourceRawText = async ({
teamId,
tmbId,
type,
sourceId,
selector,
externalFileId,
apiDatasetServer,
customPdfParse,
getFormatText,
usageId,
datasetId
}: {
teamId: string;
tmbId: string;
type: DatasetSourceReadTypeEnum;
sourceId: string;
customPdfParse?: boolean;
getFormatText?: boolean;
selector?: string; // link selector
externalFileId?: string; // external file dataset
apiDatasetServer?: ApiDatasetServerType; // api dataset
usageId?: string;
datasetId: string; // For S3 image upload
}): Promise<{
title?: string;
rawText: string;
}> => {
if (type === DatasetSourceReadTypeEnum.fileLocal) {
if (!datasetId || !isS3ObjectKey(sourceId, 'dataset')) {
return Promise.reject('datasetId is required for S3 files');
}
const { filename, rawText } = await getS3DatasetSource().getDatasetFileRawText({
teamId,
tmbId,
fileId: sourceId,
getFormatText,
customPdfParse,
usageId,
datasetId
});
return {
title: filename,
rawText
};
} else if (type === DatasetSourceReadTypeEnum.link) {
const result = await urlsFetch({
urlList: [sourceId],
selector
});
const { title = sourceId, content = '' } = result[0];
if (!content || content === 'Cannot fetch internal url') {
return Promise.reject(content || 'Can not fetch content from link');
}
return {
title,
rawText: content
};
} else if (type === DatasetSourceReadTypeEnum.externalFile) {
if (!externalFileId) return Promise.reject(new UserError('FileId not found'));
const { rawText } = await readFileRawTextByUrl({
teamId,
tmbId,
url: sourceId,
relatedId: externalFileId,
datasetId,
customPdfParse
});
return {
rawText
};
} else if (type === DatasetSourceReadTypeEnum.apiFile) {
const { title, rawText } = await readApiServerFileContent({
apiDatasetServer,
apiFileId: sourceId,
teamId,
tmbId,
customPdfParse,
datasetId
});
return {
title,
rawText
};
}
return {
title: '',
rawText: ''
};
};
export const readApiServerFileContent = async ({
apiDatasetServer,
apiFileId,
teamId,
tmbId,
customPdfParse,
datasetId
}: {
apiDatasetServer?: ApiDatasetServerType;
apiFileId: string;
teamId: string;
tmbId: string;
customPdfParse?: boolean;
datasetId: string;
}): Promise<{
title?: string;
rawText: string;
}> => {
return (await getApiDatasetRequest(apiDatasetServer)).getFileContent({
teamId,
tmbId,
apiFileId,
customPdfParse,
datasetId
});
};
export const rawText2Chunks = async ({
rawText = '',
chunkTriggerType = ChunkTriggerConfigTypeEnum.minSize,
chunkTriggerMinSize = 1000,
backupParse,
chunkSize = 512,
imageIdList,
...splitProps
}: {
rawText: string;
imageIdList?: string[];
chunkTriggerType?: ChunkTriggerConfigTypeEnum;
chunkTriggerMinSize?: number; // maxSize from agent model, not store
backupParse?: boolean;
tableParse?: boolean;
} & TextSplitProps): Promise<
{
q: string;
a: string;
indexes?: string[];
imageIdList?: string[];
}[]
> => {
const parseDatasetBackup2Chunks = (rawText: string) => {
const csvArr = Papa.parse(rawText).data as string[][];
const chunks = csvArr
.slice(1)
.map((item) => ({
q: item[0] || '',
a: item[1] || '',
indexes: item.slice(2).filter((item) => item.trim()),
imageIdList
}))
.filter((item) => item.q || item.a);
return {
chunks
};
};
if (backupParse) {
return parseDatasetBackup2Chunks(rawText).chunks;
}
// Chunk condition
// 1. 选择最大值条件,只有超过了最大值(默认为模型的最大值*0.7),才会触发分块
if (chunkTriggerType === ChunkTriggerConfigTypeEnum.maxSize) {
const textLength = rawText.trim().length;
const maxSize = splitProps.maxSize ? splitProps.maxSize * 0.7 : 16000;
if (textLength < maxSize) {
return [
{
q: rawText,
a: '',
imageIdList
}
];
}
}
// 2. 选择最小值条件,只有超过最小值(手动决定)才会触发分块
if (chunkTriggerType !== ChunkTriggerConfigTypeEnum.forceChunk) {
const textLength = rawText.trim().length;
if (textLength < chunkTriggerMinSize) {
return [{ q: rawText, a: '', imageIdList }];
}
}
const { chunks } = await text2Chunks({
text: rawText,
chunkSize,
...splitProps
});
return chunks.map((item) => ({
q: item,
a: '',
indexes: [],
imageIdList
}));
};