add sensevoice & cosevoice (#2562)

Signed-off-by: EthanD <EthanD4869@gmail.com> Co-authored-by: EthanD <EthanD4869@gmail.com>
2025-07-27 08:25:07 +00:00 · 2024-09-05 13:36:11 +08:00
parent 3671e55001
commit 5ed89130ef
38 changed files with 2284 additions and 0 deletions
--- a/python/sensevoice/app/iic/SenseVoiceSmall/.mdl
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/.mdl
--- a/python/sensevoice/app/iic/SenseVoiceSmall/.msc
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/.msc
--- a/python/sensevoice/app/iic/SenseVoiceSmall/.mv
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/.mv
@@ -0,0 +1 @@
+Revision:master,CreatedAt:1720157464
--- a/python/sensevoice/app/iic/SenseVoiceSmall/README.md
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/README.md
@@ -0,0 +1,210 @@
+---
+frameworks:
+- Pytorch
+license: Apache License 2.0
+tasks:
+- auto-speech-recognition
+
+#model-type:
+##如 gpt、phi、llama、chatglm、baichuan 等
+#- gpt
+
+#domain:
+##如 nlp、cv、audio、multi-modal
+#- nlp
+
+#language:
+##语言代码列表 https://help.aliyun.com/document_detail/215387.html?spm=a2c4g.11186623.0.0.9f8d7467kni6Aa
+#- cn 
+
+#metrics:
+##如 CIDEr、Blue、ROUGE 等
+#- CIDEr
+
+#tags:
+##各种自定义，包括 pretrained、fine-tuned、instruction-tuned、RL-tuned 等训练方法和其他
+#- pretrained
+
+#tools:
+##如 vllm、fastchat、llamacpp、AdaSeq 等
+#- vllm
+---
+
+# Highlights
+**SenseVoice**专注于高精度多语言语音识别、情感辨识和音频事件检测
+- **多语言识别：** 采用超过40万小时数据训练，支持超过50种语言，识别效果上优于Whisper模型。
+- **富文本识别：** 
+  - 具备优秀的情感识别，能够在测试数据上达到和超过目前最佳情感识别模型的效果。
+  - 支持声音事件检测能力，支持音乐、掌声、笑声、哭声、咳嗽、喷嚏等多种常见人机交互事件进行检测。
+- **高效推理：** SenseVoice-Small模型采用非自回归端到端框架，推理延迟极低，10s音频推理仅耗时70ms，15倍优于Whisper-Large。
+- **微调定制：** 具备便捷的微调脚本与策略，方便用户根据业务场景修复长尾样本问题。
+- **服务部署：** 具有完整的服务部署链路，支持多并发请求，支持客户端语言有，python、c++、html、java与c#等。
+
+
+## <strong>[SenseVoice开源项目介绍]()</strong>
+<strong>[SenseVoice]()</strong>开源模型是多语言音频理解模型，具有包括语音识别、语种识别、语音情感识别，声学事件检测能力。
+
+[**github仓库**]()
+| [**最新动态**]()
+| [**环境安装**]()
+
+# 模型结构图
+SenseVoice多语言音频理解模型，支持语音识别、语种识别、语音情感识别、声学事件检测、逆文本正则化等能力，采用工业级数十万小时的标注音频进行模型训练，保证了模型的通用识别效果。模型可以被应用于中文、粤语、英语、日语、韩语音频识别，并输出带有情感和事件的富文本转写结果。
+
+<p align="center">
+<img src="fig/sensevoice.png" alt="SenseVoice模型结构"  width="1500" />
+</p>
+
+SenseVoice-Small是基于非自回归端到端框架模型，为了指定任务，我们在语音特征前添加四个嵌入作为输入传递给编码器：
+- LID：用于预测音频语种标签。
+- SER：用于预测音频情感标签。
+- AED：用于预测音频包含的事件标签。
+- ITN：用于指定识别输出文本是否进行逆文本正则化。
+
+
+# 用法
+
+## 推理
+
+### modelscope pipeline推理
+```python
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+inference_pipeline = pipeline(
+    task=Tasks.auto_speech_recognition,
+    model='iic/SenseVoiceSmall',
+    model_revision="master")
+
+rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
+print(rec_result)
+```
+
+### 直接推理
+
+```python
+from model import SenseVoiceSmall
+
+model_dir = "iic/SenseVoiceSmall"
+m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir)
+
+
+res = m.inference(
+    data_in="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav",
+    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    **kwargs,
+)
+
+print(res)
+```
+
+### 使用funasr推理
+
+```python
+from funasr import AutoModel
+
+model_dir = "iic/SenseVoiceSmall"
+input_file = (
+    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"
+)
+
+model = AutoModel(model=model_dir,
+                  vad_model="fsmn-vad",
+                  vad_kwargs={"max_single_segment_time": 30000},
+                  trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+    input=input_file,
+    cache={},
+    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    batch_size_s=0,
+)
+
+print(res)
+```
+
+funasr版本已经集成了vad模型，支持任意时长音频输入，`batch_size_s`单位为秒。
+如果输入均为短音频，并且需要批量化推理，为了加快推理效率，可以移除vad模型，并设置`batch_size`
+
+```python
+model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+    input=input_file,
+    cache={},
+    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    batch_size=64,
+)
+```
+
+更多详细用法，请参考 [文档](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
+
+## 模型下载
+
+
+SDK下载
+```bash
+#安装ModelScope
+pip install modelscope
+```
+```python
+#SDK模型下载
+from modelscope import snapshot_download
+model_dir = snapshot_download('iic/SenseVoiceSmall')
+```
+Git下载
+```
+#Git模型下载
+git clone https://www.modelscope.cn/iic/SenseVoiceSmall.git
+```
+
+## 服务部署
+
+Undo
+
+# Performance
+
+## 语音识别效果
+我们在开源基准数据集（包括 AISHELL-1、AISHELL-2、Wenetspeech、Librispeech和Common Voice）上比较了SenseVoice与Whisper的多语言语音识别性能和推理效率。在中文和粤语识别效果上，SenseVoice-Small模型具有明显的效果优势。
+
+<p align="center">
+<img src="fig/asr_results.png" alt="SenseVoice模型在开源测试集上的表现"  width="2500" />
+</p>
+
+
+
+## 情感识别效果
+由于目前缺乏被广泛使用的情感识别测试指标和方法，我们在多个测试集的多种指标进行测试，并与近年来Benchmark上的多个结果进行了全面的对比。所选取的测试集同时包含中文/英文两种语言以及表演、影视剧、自然对话等多种风格的数据，在不进行目标数据微调的前提下，SenseVoice能够在测试数据上达到和超过目前最佳情感识别模型的效果。
+
+<p align="center">
+<img src="fig/ser_table.png" alt="SenseVoice模型SER效果1"  width="1500" />
+</p>
+
+同时，我们还在测试集上对多个开源情感识别模型进行对比，结果表明，SenseVoice-Large模型可以在几乎所有数据上都达到了最佳效果，而SenseVoice-Small模型同样可以在多数数据集上取得超越其他开源模型的效果。
+
+<p align="center">
+<img src="fig/ser_figure.png" alt="SenseVoice模型SER效果2"  width="500" />
+</p>
+
+## 事件检测效果
+
+尽管SenseVoice只在语音数据上进行训练，它仍然可以作为事件检测模型进行单独使用。我们在环境音分类ESC-50数据集上与目前业内广泛使用的BEATS与PANN模型的效果进行了对比。SenseVoice模型能够在这些任务上取得较好的效果，但受限于训练数据与训练方式，其事件分类效果专业的事件检测模型相比仍然有一定的差距。
+
+<p align="center">
+<img src="fig/aed_figure.png" alt="SenseVoice模型AED效果"  width="500" />
+</p>
+
+
+
+## 推理效率
+SenseVoice-Small模型采用非自回归端到端架构，推理延迟极低。在参数量与Whisper-Small模型相当的情况下，比Whisper-Small模型推理速度快7倍，比Whisper-Large模型快17倍。同时SenseVoice-small模型在音频时长增加的情况下，推理耗时也无明显增加。
+
+
+<p align="center">
+<img src="fig/inference.png" alt="SenseVoice模型的推理效率"  width="1500" />
+</p>
+
+<p style="color: lightgrey;">如果您是本模型的贡献者，我们邀请您根据<a href="https://modelscope.cn/docs/ModelScope%E6%A8%A1%E5%9E%8B%E6%8E%A5%E5%85%A5%E6%B5%81%E7%A8%8B%E6%A6%82%E8%A7%88" style="color: lightgrey; text-decoration: underline;">模型贡献文档</a>，及时完善模型卡片内容。</p>
--- a/python/sensevoice/app/iic/SenseVoiceSmall/am.mvn
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/am.mvn
--- a/python/sensevoice/app/iic/SenseVoiceSmall/chn_jpn_yue_eng_ko_spectok.bpe.model
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/chn_jpn_yue_eng_ko_spectok.bpe.model
--- a/python/sensevoice/app/iic/SenseVoiceSmall/config.yaml
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/config.yaml
@@ -0,0 +1,97 @@
+encoder: SenseVoiceEncoderSmall
+encoder_conf:
+    output_size: 512
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 50
+    tp_blocks: 20
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+    input_layer: pe
+    pos_enc_class: SinusoidalPositionEncoder
+    normalize_before: true
+    kernel_size: 11
+    sanm_shfit: 0
+    selfattention_layer_type: sanm
+
+
+model: SenseVoiceSmall
+model_conf:
+    length_normalized_loss: true
+    sos: 1
+    eos: 2
+    ignore_id: -1
+
+tokenizer: SentencepiecesTokenizer
+tokenizer_conf:
+  bpemodel: null
+  unk_symbol: <unk>
+  split_with_space: true
+
+frontend: WavFrontend
+frontend_conf:
+    fs: 16000
+    window: hamming
+    n_mels: 80
+    frame_length: 25
+    frame_shift: 10
+    lfr_m: 7
+    lfr_n: 6
+    cmvn_file: null
+
+
+dataset: SenseVoiceCTCDataset
+dataset_conf:
+  index_ds: IndexDSJsonl
+  batch_sampler: EspnetStyleBatchSampler
+  data_split_num: 32
+  batch_type: token
+  batch_size: 14000
+  max_token_length: 2000
+  min_token_length: 60
+  max_source_length: 2000
+  min_source_length: 60
+  max_target_length: 200
+  min_target_length: 0
+  shuffle: true
+  num_workers: 4
+  sos: ${model_conf.sos}
+  eos: ${model_conf.eos}
+  IndexDSJsonl: IndexDSJsonl
+  retry: 20
+
+train_conf:
+  accum_grad: 1
+  grad_clip: 5
+  max_epoch: 20
+  keep_nbest_models: 10
+  avg_nbest_model: 10
+  log_interval: 100
+  resume: true
+  validate_interval: 10000
+  save_checkpoint_interval: 10000
+
+optim: adamw
+optim_conf:
+  lr: 0.00002
+scheduler: warmuplr
+scheduler_conf:
+  warmup_steps: 25000
+
+specaug: SpecAugLFR
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
--- a/python/sensevoice/app/iic/SenseVoiceSmall/configuration.json
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/configuration.json
@@ -0,0 +1,14 @@
+{
+  "framework": "pytorch",
+  "task" : "auto-speech-recognition",
+  "model": {"type" : "funasr"},
+  "pipeline": {"type":"funasr-pipeline"},
+  "model_name_in_hub": {
+    "ms":"", 
+    "hf":""},
+  "file_path_metas": {
+    "init_param":"model.pt", 
+    "config":"config.yaml",
+    "tokenizer_conf": {"bpemodel": "chn_jpn_yue_eng_ko_spectok.bpe.model"},
+    "frontend_conf":{"cmvn_file": "am.mvn"}}
+}
--- a/python/sensevoice/app/iic/SenseVoiceSmall/fig/aed_figure.png
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/fig/aed_figure.png
--- a/python/sensevoice/app/iic/SenseVoiceSmall/fig/asr_results.png
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/fig/asr_results.png
--- a/python/sensevoice/app/iic/SenseVoiceSmall/fig/inference.png
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/fig/inference.png
--- a/python/sensevoice/app/iic/SenseVoiceSmall/fig/sensevoice.png
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/fig/sensevoice.png
--- a/python/sensevoice/app/iic/SenseVoiceSmall/fig/ser_figure.png
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/fig/ser_figure.png
--- a/python/sensevoice/app/iic/SenseVoiceSmall/fig/ser_table.png
+++ b/python/sensevoice/app/iic/SenseVoiceSmall/fig/ser_table.png
--- a/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/.mdl
+++ b/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/.mdl
--- a/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/.msc
+++ b/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/.msc
--- a/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/.mv
+++ b/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/.mv
@@ -0,0 +1 @@
+Revision:master,CreatedAt:1707184291
--- a/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/README.md
+++ b/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/README.md
@@ -0,0 +1,296 @@
+---
+tasks:
+- voice-activity-detection
+domain:
+- audio
+model-type:
+- VAD model
+frameworks:
+- pytorch
+backbone:
+- fsmn
+metrics:
+- f1_score
+license: Apache License 2.0
+language: 
+- cn
+tags:
+- FunASR
+- FSMN
+- Alibaba
+- Online
+datasets:
+  train:
+  - 20,000 hour industrial Mandarin task
+  test:
+  - 20,000 hour industrial Mandarin task
+widgets:
+  - task: voice-activity-detection
+    model_revision: v2.0.4
+    inputs:
+      - type: audio
+        name: input
+        title: 音频
+    examples:
+      - name: 1
+        title: 示例1
+        inputs:
+          - name: input
+            data: git://example/vad_example.wav 
+    inferencespec:
+      cpu: 1 #CPU数量
+      memory: 4096
+---
+
+# FSMN-Monophone VAD 模型介绍
+
+[//]: # (FSMN-Monophone VAD 模型)
+
+## Highlight
+- 16k中文通用VAD模型：可用于检测长语音片段中有效语音的起止时间点。
+  - 基于[Paraformer-large长音频模型](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)场景的使用
+  - 基于[FunASR框架](https://github.com/alibaba-damo-academy/FunASR)，可进行ASR，VAD，[中文标点](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary)的自由组合
+  - 基于音频数据的有效语音片段起止时间点检测
+
+## <strong>[FunASR开源项目介绍](https://github.com/alibaba-damo-academy/FunASR)</strong>
+<strong>[FunASR](https://github.com/alibaba-damo-academy/FunASR)</strong>希望在语音识别的学术研究和工业应用之间架起一座桥梁。通过发布工业级语音识别模型的训练和微调，研究人员和开发人员可以更方便地进行语音识别模型的研究和生产，并推动语音识别生态的发展。让语音识别更有趣！
+
+[**github仓库**](https://github.com/alibaba-damo-academy/FunASR)
+| [**最新动态**](https://github.com/alibaba-damo-academy/FunASR#whats-new) 
+| [**环境安装**](https://github.com/alibaba-damo-academy/FunASR#installation)
+| [**服务部署**](https://www.funasr.com)
+| [**模型库**](https://github.com/alibaba-damo-academy/FunASR/tree/main/model_zoo)
+| [**联系我们**](https://github.com/alibaba-damo-academy/FunASR#contact)
+
+
+## 模型原理介绍
+
+FSMN-Monophone VAD是达摩院语音团队提出的高效语音端点检测模型，用于检测输入音频中有效语音的起止时间点信息，并将检测出来的有效音频片段输入识别引擎进行识别，减少无效语音带来的识别错误。
+
+<p align="center">
+<img src="fig/struct.png" alt="VAD模型结构"  width="500" />
+
+FSMN-Monophone VAD模型结构如上图所示：模型结构层面，FSMN模型结构建模时可考虑上下文信息，训练和推理速度快，且时延可控；同时根据VAD模型size以及低时延的要求，对FSMN的网络结构、右看帧数进行了适配。在建模单元层面，speech信息比较丰富，仅用单类来表征学习能力有限，我们将单一speech类升级为Monophone。建模单元细分，可以避免参数平均，抽象学习能力增强，区分性更好。
+
+## 基于ModelScope进行推理
+
+- 推理支持音频格式如下：
+  - wav文件路径，例如：data/test/audios/vad_example.wav
+  - wav文件url，例如：https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav
+  - wav二进制数据，格式bytes，例如：用户直接从文件里读出bytes数据或者是麦克风录出bytes数据。
+  - 已解析的audio音频，例如：audio, rate = soundfile.read("vad_example_zh.wav")，类型为numpy.ndarray或者torch.Tensor。
+  - wav.scp文件，需符合如下要求：
+
+```sh
+cat wav.scp
+vad_example1  data/test/audios/vad_example1.wav
+vad_example2  data/test/audios/vad_example2.wav
+...
+```
+
+- 若输入格式wav文件url，api调用方式可参考如下范例：
+
+```python
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+inference_pipeline = pipeline(
+    task=Tasks.voice_activity_detection,
+    model='iic/speech_fsmn_vad_zh-cn-16k-common-pytorch',
+    model_revision="v2.0.4",
+)
+
+segments_result = inference_pipeline(input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav')
+print(segments_result)
+```
+
+- 输入音频为pcm格式，调用api时需要传入音频采样率参数fs，例如：
+
+```python
+segments_result = inference_pipeline(input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.pcm', fs=16000)
+```
+
+- 若输入格式为文件wav.scp(注：文件名需要以.scp结尾)，可添加 output_dir 参数将识别结果写入文件中，参考示例如下：
+
+```python
+inference_pipeline(input="wav.scp", output_dir='./output_dir')
+```
+识别结果输出路径结构如下：
+
+```sh
+tree output_dir/
+output_dir/
+└── 1best_recog
+    └── text
+
+1 directory, 1 files
+```
+text：VAD检测语音起止时间点结果文件（单位：ms）
+
+- 若输入音频为已解析的audio音频，api调用方式可参考如下范例：
+
+```python
+import soundfile
+
+waveform, sample_rate = soundfile.read("vad_example_zh.wav")
+segments_result = inference_pipeline(input=waveform)
+print(segments_result)
+```
+
+- VAD常用参数调整说明（参考：vad.yaml文件）：
+  - max_end_silence_time：尾部连续检测到多长时间静音进行尾点判停，参数范围500ms～6000ms，默认值800ms(该值过低容易出现语音提前截断的情况)。
+  - speech_noise_thres：speech的得分减去noise的得分大于此值则判断为speech，参数范围：（-1,1）
+    - 取值越趋于-1，噪音被误判定为语音的概率越大，FA越高
+    - 取值越趋于+1，语音被误判定为噪音的概率越大，Pmiss越高
+    - 通常情况下，该值会根据当前模型在长语音测试集上的效果取balance
+    
+
+
+
+## 基于FunASR进行推理
+
+下面为快速上手教程，测试音频（[中文](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav)，[英文](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav)）
+
+### 可执行命令行
+在命令行终端执行：
+
+```shell
+funasr ++model=paraformer-zh ++vad_model="fsmn-vad" ++punc_model="ct-punc" ++input=vad_example.wav
+```
+
+注：支持单条音频文件识别，也支持文件列表，列表为kaldi风格wav.scp：`wav_id   wav_path`
+
+### python示例
+#### 非实时语音识别
+```python
+from funasr import AutoModel
+# paraformer-zh is a multi-functional asr model
+# use vad, punc, spk or not as you need
+model = AutoModel(model="paraformer-zh", model_revision="v2.0.4",
+                  vad_model="fsmn-vad", vad_model_revision="v2.0.4",
+                  punc_model="ct-punc-c", punc_model_revision="v2.0.4",
+                  # spk_model="cam++", spk_model_revision="v2.0.2",
+                  )
+res = model.generate(input=f"{model.model_path}/example/asr_example.wav", 
+            batch_size_s=300, 
+            hotword='魔搭')
+print(res)
+```
+注：`model_hub`：表示模型仓库，`ms`为选择modelscope下载，`hf`为选择huggingface下载。
+
+#### 实时语音识别
+
+```python
+from funasr import AutoModel
+
+chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
+encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
+decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
+
+model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4")
+
+import soundfile
+import os
+
+wav_file = os.path.join(model.model_path, "example/asr_example.wav")
+speech, sample_rate = soundfile.read(wav_file)
+chunk_stride = chunk_size[1] * 960 # 600ms
+
+cache = {}
+total_chunk_num = int(len((speech)-1)/chunk_stride+1)
+for i in range(total_chunk_num):
+    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
+    is_final = i == total_chunk_num - 1
+    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
+    print(res)
+```
+
+注：`chunk_size`为流式延时配置，`[0,10,5]`表示上屏实时出字粒度为`10*60=600ms`，未来信息为`5*60=300ms`。每次推理输入为`600ms`（采样点数为`16000*0.6=960`），输出为对应文字，最后一个语音片段输入需要设置`is_final=True`来强制输出最后一个字。
+
+#### 语音端点检测（非实时）
+```python
+from funasr import AutoModel
+
+model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
+
+wav_file = f"{model.model_path}/example/asr_example.wav"
+res = model.generate(input=wav_file)
+print(res)
+```
+
+#### 语音端点检测（实时）
+```python
+from funasr import AutoModel
+
+chunk_size = 200 # ms
+model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
+
+import soundfile
+
+wav_file = f"{model.model_path}/example/vad_example.wav"
+speech, sample_rate = soundfile.read(wav_file)
+chunk_stride = int(chunk_size * sample_rate / 1000)
+
+cache = {}
+total_chunk_num = int(len((speech)-1)/chunk_stride+1)
+for i in range(total_chunk_num):
+    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
+    is_final = i == total_chunk_num - 1
+    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)
+    if len(res[0]["value"]):
+        print(res)
+```
+
+#### 标点恢复
+```python
+from funasr import AutoModel
+
+model = AutoModel(model="ct-punc", model_revision="v2.0.4")
+
+res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
+print(res)
+```
+
+#### 时间戳预测
+```python
+from funasr import AutoModel
+
+model = AutoModel(model="fa-zh", model_revision="v2.0.4")
+
+wav_file = f"{model.model_path}/example/asr_example.wav"
+text_file = f"{model.model_path}/example/text.txt"
+res = model.generate(input=(wav_file, text_file), data_type=("sound", "text"))
+print(res)
+```
+
+更多详细用法（[示例](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining)）
+
+
+## 微调
+
+详细用法（[示例](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining)）
+
+
+
+
+
+## 使用方式以及适用范围
+
+运行范围
+- 支持Linux-x86_64、Mac和Windows运行。
+
+使用方式
+- 直接推理：可以直接对长语音数据进行计算，有效语音片段的起止时间点信息（单位：ms）。
+
+## 相关论文以及引用信息
+
+```BibTeX
+@inproceedings{zhang2018deep,
+  title={Deep-FSMN for large vocabulary continuous speech recognition},
+  author={Zhang, Shiliang and Lei, Ming and Yan, Zhijie and Dai, Lirong},
+  booktitle={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  pages={5869--5873},
+  year={2018},
+  organization={IEEE}
+}
+```
--- a/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/am.mvn
+++ b/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/am.mvn
@@ -0,0 +1,8 @@
+<Nnet>
+<Splice> 400 400
+[ 0 ]
+<AddShift> 400 400
+<LearnRateCoef> 0 [ -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 ]
+<Rescale> 400 400
+<LearnRateCoef> 0 [ 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 ]
+</Nnet>
--- a/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/config.yaml
+++ b/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/config.yaml
@@ -0,0 +1,56 @@
+frontend: WavFrontendOnline
+frontend_conf:
+    fs: 16000
+    window: hamming
+    n_mels: 80
+    frame_length: 25
+    frame_shift: 10
+    dither: 0.0
+    lfr_m: 5
+    lfr_n: 1
+
+model: FsmnVADStreaming
+model_conf:
+    sample_rate: 16000
+    detect_mode: 1 
+    snr_mode: 0
+    max_end_silence_time: 800
+    max_start_silence_time: 3000
+    do_start_point_detection: True
+    do_end_point_detection: True
+    window_size_ms: 200
+    sil_to_speech_time_thres: 150
+    speech_to_sil_time_thres: 150
+    speech_2_noise_ratio: 1.0
+    do_extend: 1
+    lookback_time_start_point: 200
+    lookahead_time_end_point: 100
+    max_single_segment_time: 60000
+    snr_thres: -100.0
+    noise_frame_num_used_for_snr: 100
+    decibel_thres: -100.0
+    speech_noise_thres: 0.6
+    fe_prior_thres: 0.0001
+    silence_pdf_num: 1
+    sil_pdf_ids: [0]
+    speech_noise_thresh_low: -0.1
+    speech_noise_thresh_high: 0.3
+    output_frame_probs: False
+    frame_in_ms: 10
+    frame_length_ms: 25
+    
+encoder: FSMN
+encoder_conf:
+    input_dim: 400
+    input_affine_dim: 140
+    fsmn_layers: 4
+    linear_dim: 250
+    proj_dim: 128
+    lorder: 20
+    rorder: 0
+    lstride: 1
+    rstride: 0
+    output_affine_dim: 140
+    output_dim: 248
+
+
--- a/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/configuration.json
+++ b/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/configuration.json
@@ -0,0 +1,13 @@
+{
+  "framework": "pytorch",
+  "task" : "voice-activity-detection",
+  "pipeline": {"type":"funasr-pipeline"},
+  "model": {"type" : "funasr"},
+  "file_path_metas": {
+    "init_param":"model.pt", 
+    "config":"config.yaml",
+    "frontend_conf":{"cmvn_file": "am.mvn"}},
+  "model_name_in_hub": {
+    "ms":"iic/speech_fsmn_vad_zh-cn-16k-common-pytorch", 
+    "hf":""}
+}
--- a/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/example/vad_example.wav
+++ b/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/example/vad_example.wav
--- a/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/fig/struct.png
+++ b/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/fig/struct.png
--- a/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt
+++ b/python/sensevoice/app/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt