diff --git a/document/content/docs/introduction/development/custom-models/meta.json b/document/content/docs/introduction/development/custom-models/meta.json index 5bf26cd7d..1746a2b77 100644 --- a/document/content/docs/introduction/development/custom-models/meta.json +++ b/document/content/docs/introduction/development/custom-models/meta.json @@ -1,4 +1,4 @@ { "title": "本地模型使用", - "pages": ["marker","xinference","bge-rerank","chatglm2","m3e","chatglm2-m3e","ollama"] + "pages": ["marker","mineru","xinference","bge-rerank","chatglm2","m3e","chatglm2-m3e","ollama"] } \ No newline at end of file diff --git a/document/content/docs/introduction/development/custom-models/mineru.mdx b/document/content/docs/introduction/development/custom-models/mineru.mdx new file mode 100644 index 000000000..773a24de1 --- /dev/null +++ b/document/content/docs/introduction/development/custom-models/mineru.mdx @@ -0,0 +1,83 @@ +--- +title: 接入 MinerU PDF 文档解析 +description: 使用 MinerU 解析 PDF 文档,可实现图片提取、布局识别、表格识别和公式识别 +--- + +## 背景 + +PDF 是一个相对复杂的文件格式,在 FastGPT 内置的 pdf 解析器中,依赖的是 pdfjs 库解析,该库基于逻辑解析,无法有效的理解复杂的 pdf 文件。所以我们在解析 pdf 时候,如果遇到图片、表格、公式等非简单文本内容,会发现解析效果不佳。 + +市面上目前有多种解析 PDF 的方法,比如使用 [MinerU](https://github.com/opendatalab/MinerU),该项目使用了 YOLO、PaddleOCR以及表格识别等模型,基于视觉解析,可以有效提取图片、表格、公式等复杂内容。 + +社区版用户可以在`config.json`文件中添加`systemEnv.customPdfParse`配置,来使用 MinerU 解析 PDF 文件。商业版用户直接在 Admin 后台根据表单指引填写即可,使用教程中会详细解释。 + +## 使用教程 + +硬件需求:16g+ 的gpu显存推理卡,最小 16GB+, 推荐 32GB+的内存,其他要求查看[官网](https://github.com/opendatalab/MinerU) + +### 1. 安装 MinerU + +这里介绍快速 Docker 安装的方法: + +拉取fastgpt-mineru镜像 ---> 创建容器启动解析服务 ---> 把部署好的url地址接入到fastgpt配置文件中 + +```dockerfile +docker pull crpi-h3snc261q1dosroc.cn-hangzhou.personal.cr.aliyuncs.com/fastgpt_ck/mineru:v1 +docker run --gpus all -itd -p 7231:8001 --name mode_pdf_minerU crpi-h3snc261q1dosroc.cn-hangzhou.personal.cr.aliyuncs.com/fastgpt_ck/mineru:v1 +``` +这里的mineru接入的是pipeline模式,并且在docker内部进行了并行化,会根据gpu数量创建多个进程来同时处理上传的pdf数据 + +### 2. 添加 FastGPT 文件配置 + +```json +{ + xxx + "systemEnv": { + xxx + "customPdfParse": { + "url": "http://xxxx.com/v2/parse/file", // 自定义 PDF 解析服务地址 MinerU + "key": "", // 自定义 PDF 解析服务密钥 + "doc2xKey": "", // doc2x 服务密钥 + "price": 0 // PDF 解析服务价格 + } + } +} +``` + +商业版请按下图配置 + +![alt text](/imgs/mineru6.png) + +**注意:** 通过配置文件添加的服务需要重启服务。 + +### 3. 测试效果 + +通过知识库上传一个 pdf 文件,并勾选上 `PDF 增强解析`。 + +![alt text](/imgs/mineru1.png) + +确认上传后,可以在日志中看到 LOG (LOG_LEVEL需要设置 info 或者 debug): + +``` +[Info] 2024-12-05 15:04:42 Parsing files from an external service +[Info] 2024-12-05 15:07:08 Custom file parsing is complete, time: 1316ms +``` + + +同样的,在应用中,你可以在文件上传配置里,勾选上 `PDF 增强解析`。 + +![alt text](/imgs/mineru2.png) + +## 效果展示 + +以清华的 [ChatDev Communicative Agents for Software Develop.pdf](https://arxiv.org/abs/2307.07924) 为例,展示 MinerU 解析的效果: + +| | | | +| ------------------------------- | ------------------------------- | ------------------------------- | +| ![alt text](/imgs/mineru3-1.png) | ![alt text](/imgs/mineru4-1.png) | ![alt text](/imgs/mineru5-1.png) | +| ![alt text](/imgs/mineru3.png) | ![alt text](/imgs/mineru4.png) | ![alt text](/imgs/mineru5.png) | + +上图是分块后的结果,下图是 pdf 原文。整体图片、公式、ocr手写体都可以提取出来,效果还是可以的。 + +不过要注意的是,[MinerU](https://github.com/opendatalab/MinerU) 的协议是`GPL-3.0 license`,请在遵守协议的前提下使用。 + diff --git a/document/content/docs/toc.mdx b/document/content/docs/toc.mdx index 1fd7237fe..f98b3e944 100644 --- a/document/content/docs/toc.mdx +++ b/document/content/docs/toc.mdx @@ -19,6 +19,7 @@ description: FastGPT 文档目录 - [/docs/introduction/development/custom-models/chatglm2-m3e](/docs/introduction/development/custom-models/chatglm2-m3e) - [/docs/introduction/development/custom-models/m3e](/docs/introduction/development/custom-models/m3e) - [/docs/introduction/development/custom-models/marker](/docs/introduction/development/custom-models/marker) +- [/docs/introduction/development/custom-models/mineru](/docs/introduction/development/custom-models/mineru) - [/docs/introduction/development/custom-models/ollama](/docs/introduction/development/custom-models/ollama) - [/docs/introduction/development/custom-models/xinference](/docs/introduction/development/custom-models/xinference) - [/docs/introduction/development/design/dataset](/docs/introduction/development/design/dataset) diff --git a/document/public/imgs/mineru1.png b/document/public/imgs/mineru1.png new file mode 100644 index 000000000..16087de64 Binary files /dev/null and b/document/public/imgs/mineru1.png differ diff --git a/document/public/imgs/mineru2.png b/document/public/imgs/mineru2.png new file mode 100644 index 000000000..1af5b591d Binary files /dev/null and b/document/public/imgs/mineru2.png differ diff --git a/document/public/imgs/mineru3-1.png b/document/public/imgs/mineru3-1.png new file mode 100644 index 000000000..3f6ce2806 Binary files /dev/null and b/document/public/imgs/mineru3-1.png differ diff --git a/document/public/imgs/mineru3.png b/document/public/imgs/mineru3.png new file mode 100644 index 000000000..5e635fcf8 Binary files /dev/null and b/document/public/imgs/mineru3.png differ diff --git a/document/public/imgs/mineru4-1.png b/document/public/imgs/mineru4-1.png new file mode 100644 index 000000000..19b8d3f2f Binary files /dev/null and b/document/public/imgs/mineru4-1.png differ diff --git a/document/public/imgs/mineru4.png b/document/public/imgs/mineru4.png new file mode 100644 index 000000000..23e1524f0 Binary files /dev/null and b/document/public/imgs/mineru4.png differ diff --git a/document/public/imgs/mineru5-1.png b/document/public/imgs/mineru5-1.png new file mode 100644 index 000000000..b0ff98cc2 Binary files /dev/null and b/document/public/imgs/mineru5-1.png differ diff --git a/document/public/imgs/mineru5.png b/document/public/imgs/mineru5.png new file mode 100644 index 000000000..b1cab03ec Binary files /dev/null and b/document/public/imgs/mineru5.png differ diff --git a/document/public/imgs/mineru6.png b/document/public/imgs/mineru6.png new file mode 100644 index 000000000..afe656385 Binary files /dev/null and b/document/public/imgs/mineru6.png differ