本地运行 Funasr-Nano-2512 模型

模型介绍

  • Fun-ASR-Nano-2512是一款基于数千万小时真实语音数据训练的端到端语音识别大模型。支持低延迟实时转写,覆盖31种语言。

  • 模型核心功能:

    • 准确识别远场高噪声环境下的语音。
    • 支持7种中文方言和26种地区口音。
    • 支持31种国际语言,支持多语言的自由切换和混合识别。
    • 音乐背景下的歌词识别。

环境准备

  • 基础环境:

    • Ubuntu 24.04.2 LTS
    • NVIDIA-SMI 550.120
    • CUDA Version: 12.4
    • Python 3.12.9
    • ffmpeg
  • 依赖版本:

    • transformers==4.57.3
    • modelscope==1.33.0
    • torch==2.9.1
    • torchaudio==2.9.1
    • torchcodec==0.9.1
    • funasr==1.2.9

运行步骤

  • 安装依赖:

    1
    pip install transformers==4.57.3 modelscope==1.33.0 torch==2.9.1 torchaudio==2.9.1 torchcodec==0.9.1 funasr==1.2.9
  • 下载远程调用代码:

    1
    wget https://raw.githubusercontent.com/FunAudioLLM/Fun-ASR/main/model.py
  • 编写运行代码run.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    from funasr import AutoModel

    def main():
    model_dir = "FunAudioLLM/Fun-ASR-Nano-2512"
    wav_path = "别来无恙.m4a"
    model = AutoModel(
    model=model_dir,
    trust_remote_code=True,
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    remote_code="./model.py",
    device="cuda:0",
    )
    res = model.generate(input=[wav_path], cache={}, batch_size=1)
    text = res[0]["text"]
    print(text)


    if __name__ == "__main__":
    main()
  • 上传音频文件,如别来无恙.m4a

  • 运行代码:

    1
    python run.py

    输出:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    funasr version: 1.2.9.
    Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
    You are using the latest version of funasr-1.2.9
    Downloading Model from https://www.modelscope.cn to directory: /home/andy/.cache/modelscope/hub/models/FunAudioLLM/Fun-ASR-Nano-2512
    2025-12-29 11:43:38,250 - modelscope - INFO - Got 1 files, start to download ...
    Downloading [config.yaml]: 100%|████████████████████████████████████████████████| 3.07k/3.07k [00:00<00:00, 11.6kB/s]
    Processing 1 items: 100%|█████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 3.66it/s]
    2025-12-29 11:43:38,524 - modelscope - INFO - Download model 'FunAudioLLM/Fun-ASR-Nano-2512' successfully.
    WARNING:root:trust_remote_code: True
    Loading remote code successfully: ./model.py
    Downloading Model from https://www.modelscope.cn to directory: /home/andy/.cache/modelscope/hub/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch
    WARNING:root:trust_remote_code: False
    rtf_avg: 0.011: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.54it/s]
    0%| | 0/3 [00:00<?, ?it/s]
    The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
    Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
    The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
    {'load_data': '0.000', 'extract_feat': '0.002', 'forward': '0.340', 'batch_size': '1', 'rtf': '0.113'}, : 33%|▎| 1/3The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
    Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
    {'load_data': '0.000', 'extract_feat': '0.002', 'forward': '0.202', 'batch_size': '1', 'rtf': '0.064'}, : 67%|▋| 2/3The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
    Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
    rtf_avg: 0.062: 100%|██████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.35it/s]
    rtf_avg: 0.061, time_speech: 36.928, time_escape: 2.235: 100%|████████████████████████| 1/1 [00:02<00:00, 2.39s/it]
    千辛万苦,忘不了你的模样,忘不了的遍体鳞伤,成为我的力量,忘不了眼神里的光,常在我心上,当我又回头张望,提醒我坚强,用真遗忘他。 的人常在我。 心伤。

参考文献