使用colab来实现用OpenAI开源的Whisper自动生成srt字幕

发表于 2023-04-20 分类于折腾日志 Disqus：本文字数： 2.9k 阅读时长 ≈ 11 分钟

现在模型时代很火，不是最近openai的一个whisper模型超级火，然后我发现它识别效果是很棒啊，感觉能和讯飞字幕差不多了，而且我很喜欢的一点是，它会为特定的词汇自动按官方标准的写法写，例如 MacBook Pro，Type-C， HDMI，USB3 Gen2，就是不会写成 macbook pro，usb3，这样，这点就很喜欢，然后遗憾的是它需要GPU机器，我看了下手头的机器。。想到了Google Colab，如果我让它能够一键处理字幕那不就好了么。（本文需要有境外访问能力，本文不会写任何关于如何访问境外内容）

(English version translate by GPT-3.5)

前言

这几天试用了下whisper，真的挺好用，中文识别精度也挺高，虽然讯飞字幕也很不错，但是。。它是收费的啊。。。正好嘛，我也一直帮我一朋友做字幕，他有新的创作我就会帮他上字幕，正好可以使用下这个，这里的视频特指youtube视频。

然后我看到这篇文章比较后才知道：这些语音转文字工具哪个才是真正的王者！ - 百度百家号，里面有各个转文字的精确度对比（结果未经验证），看到30s的视频，精确度如下（数据来源于上面文章），可以看到whisper很前面的。

不同语音识别模式	与正确文字稿相似度
飞书妙记	0.9898
whisper的large-v1模式	0.9865
剪映	0.9865
whisper的large-v2模式	0.9797
whisper的large模式	0.9797
必剪	0.9797
微软自带语音识别	0.9695
网易见外工作台	0.9662
whisper的medium模式	0.9625
whisper的small模式	lall模式
whisper的base模式	0.8805
whisper的tiny模式	0.8454

思路

我是想做到能够我不需要操作太多。

我提供一个视频链接。
Google colab就能帮我自动下载，我只需要修改下一个变量，点击运行下colab笔记本中的一段代码。
然后我该干嘛干嘛去，只要网页挂着不动就好了。
colab能够自动将生成好的srt文件存到google云中
附：最好还能通知我一声，让我知道在这一刻，字幕处理已经完成了。

整理思路步骤

第一步，yt-dlp工具

这里我看中了github上的 yt-dlp - Github 工具，它能够解析youtube，并且下载我需要的指定解析度。

所以第一步就很清楚了，先得把视频下过来，当然这里我并不需要下视频，我只需要下载音频就行了，我这里选择了一位知名Up主视频进行测试用(这位Up主与我上面提到的朋友没任何联系) 极客湾Geekerwan - RTX3070TI首发评测：本来是张好显卡，等等吧 - YouTube

我们使用下面命令解析出视频后

1	yt-dlp -F "https://www.youtube.com/watch?v=5wF1YItz78Y"

如下

[you.....] Extracting URL: https://www.you......com/watch?v=*******
[you.....] *******: Downloading webpage
[you.....] *******: Downloading android player API JSON
[you.....] *******: Downloading player 6f20102c
WARNING: [you.....] *******: nsig extraction failed: You may experience throttling for some formats
         Install PhantomJS to workaround the issue. Please download it from https://phantomjs.org/download.html
         n = ...... ; player = .....
[info] Available formats for *******:
ID  EXT   RESOLUTION FPS CH │   FILESIZE   TBR PROTO │ VCODEC        VBR ACODEC      ABR ASR MORE INFO
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
....
140 m4a   audio only      2 │    5.49MiB  129k dash  │ audio only        mp4a.40.2  129k 44k medium, m4a_dash
251 webm  audio only      2 │    4.73MiB  112k dash  │ audio only        opus       112k 48k medium, webm_dash
17  3gp   176x144      8  1 │    2.70MiB   64k https │ mp4v.20.3     64k mp4a.40.2    0k 22k 144p
597 mp4   256x144     15    │  990.76KiB   23k dash  │ avc1.4d400b   23k video only          144p, mp4_dash
....
315 webm  3840x2160   60    │  402.64MiB 9499k dash  │ vp9         9499k video only          2160p60, webm_dash

其中video_only肯定不能要，因为我只要语音部分就行了，所以上面我选择序号140的audio_only

第二步，下载音频

我们用这一条命令来下载音频，为了确保后缀名字一致，加上 -o download.m4a，就像这样，不然就会按照视频名字来命名

1	yt-dlp -f 140 "https://www.youtube.com/watch?v=5wF1YItz78Y" -o download.m4a

第三步，安装whisper依赖

这里很简单，按照 whisper - Github 的指引，一句话即可

1	pip install openai-whisper

第四步，生成字幕

按照whisper的帮助描述，我们这样写

1	whisper --model large-v2 --model_dir=./ --output_format srt --output_dir outsrt download.m4a

model表示使用的模型，模型使用large-v2，可选模型有如下，如果本地模型不存在会自动下载

tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large} 默认使用 small

model_dir 表示模型存放的目录，如果自己从自己服务器下模型对话，可以指定，避免下载2次，然后它要命名成上面model的名字，例如 large-v2.pt，我实际执行时去掉了这个参数。

output_format 表示输出为srt格式，可选的有

txt,vtt,srt,tsv,json,all

output_dir 表示输出到这个目录，我选择输出到 outsrt 目录，如果这个目录不存在会自动创建的

最后的download.m4a 就是待处理的文件

逻辑整理完了，开始写Notebook

至此，逻辑完成，此时 output_dir的目录下应该就生成了srt文件了，且srt文件的文件名是 m4a的文件名，例如 download.srt

在Colab中，如果代码段中，以 ! 开始，表示执行shell命令，如果不加，就是表示这是一段python，而且，每一行shell的当前目录都是/content，每一行代码环境都是独立的，也就是说就像这样

! mkdir files # 命令执行成功
! cd files # 进入到files，但是没卵用
! pwd # 输出 /content
! ls # 输出 files sample_data

也就是说，除非写成 !mkdir files && pwd，那才是在files执行pwd

挂载google drive

我不是希望在处理完毕后，能自动保存到我的google云嘛，所以我这边先挂载上，因为挂载是需要授权的，不然先处理，等1个小时后再弹出授权，眼泪都有了

1
2
3

from google.colab import drive
drive.mount('/content/gdrive')
!mkdir /content/gdrive/MyDrive/srtFiles

google挂载后，gdrive目录实际是在挂载目录/MyDrive 下的

下载并安装yt-dlp

首先肯定得先下载，然后进行解压，进行赋权，这里我定一个变量，填写起来会更加方便，因为我发现，纯音频固定的format_id == 140，所以这里写死就好，下载完毕后，删掉yt-dlp，万一有检测呢哈哈哈哈哈，虽然多此一举。

1
2

!export YT_VIDEO="5wF1YItz78Y" && wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -O yt-dlp && chmod +x yt-dlp && ./yt-dlp -f 140 "https://www.youtube.com/watch?v=$YT_VIDEO" -o download.m4a
!rm -rf yt-dlp

安装依赖

此时，download.m4a文件已经有了，安装whisper依赖，这里我多安装一个requests，因为我要在它处理好后能通知我嘛，网络请求库得准备好

1	!pip install openai-whisper requests

开始处理

我希望在它处理好后，能输出一个DONE

1	!whisper --model large-v2 --output_format srt --output_dir output_dir download.m4a && echo "DONE!"

上传文件

复制到gdrive

1	!cp outsrt/download.srt /content/gdrive/MyDrive/srtFiles

测试整个代码

这个代码块就长这样

from google.colab import drive
drive.mount('/content/gdrive')
!mkdir /content/gdrive/MyDrive/srtFiles

!export YT_VIDEO="5wF1YItz78Y" && wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -O yt-dlp && chmod +x yt-dlp && ./yt-dlp -f 140 "https://www.youtube.com/watch?v=$YT_VIDEO" -o download.m4a

!pip install openai-whisper requests

!whisper --model large-v2 --output_format srt --output_dir outsrt download.m4a && echo "DONE!"

!cp outsrt/download.srt /content/gdrive/MyDrive/srtFiles

在运行后，首先弹出的就是授权

1
2

Permit this notebook to access your Google Drive files?
This notebook is requesting access to your Google Drive files. Granting access to Google Drive will permit code executed in the notebook to modify files in your Google Drive. Make sure to review notebook code prior to allowing this access.

然后，网页别关（有pro+当我没说），喝咖啡玩游戏都行，处理完毕后，文件就自动出现在云盘指定文件夹中了

file-in-drive

这是colab输出的完整内容总计耗时 5分46秒

Mounted at /content/gdrive
--2023-04-20 07:36:19--  https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp
Resolving github.com (github.com)... a.b.c.d
Connecting to github.com (github.com)|a.b.c.d|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/yt-dlp/yt-dlp/releases/download/2023.03.04/yt-dlp [following]
--2023-04-20 07:36:19--  https://github.com/yt-dlp/yt-dlp/releases/download/2023.03.04/yt-dlp
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: ...... [following]
--2023-04-20 07:36:19--  https://objects.githubus.....t-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... a1.b1.c1.d1, a1.b1.c1.d1, a1.b1.c1.d1, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|a1.b1.c1.d1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2747279 (2.6M) [application/octet-stream]
Saving to: ‘yt-dlp’

yt-dlp              100%[===================>]   2.62M  --.-KB/s    in 0.04s   

2023-04-20 07:36:19 (70.9 MB/s) - ‘yt-dlp’ saved [2747279/2747279]

[youtube] Extracting URL: https://www.youtube.com/watch?v=5w*****78Y
[youtube] 5w*****78Y: Downloading webpage
[youtube] 5w*****78Y: Downloading android player API JSON
[youtube] 5w*****78Y: Downloading player 6f20102c
WARNING: [youtube] 5w*****78Y: nsig extraction failed: You may experience throttling for some formats
         Install PhantomJS to workaround the issue. Please download it from https://phantomjs.org/download.html
         n = ****** ; player = https://www.youtube.com/s/player/6f20102c/player_ias.vflset/en_US/base.js
[info] 5w*****78Y: Downloading 1 format(s): 140
[dashsegments] Total fragments: 1
[download] Destination: download.m4a
[download] 100% of    5.49MiB in 00:00:00 at 37.43MiB/s
[FixupM4a] Correcting container of "download.m4a"
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai-whisper
.....
Building wheels for collected packages: openai-whisper
  Building wheel for openai-whisper (pyproject.toml) ... done
  Created wheel for openai-whisper: filename=openai_whisper-20230314-py3-none-any.whl size=796926 sha256=25ed5b9392f9e546a02428e155b3b832633eee99f065fa254447a2c17a61f10f
  Stored in directory: /root/.cache/pip/wheels/c4/85/e6/0bb9507b8e4f3f6d9c6dcf318bc3514739430375aa8e9eaf5b
Successfully built openai-whisper
Installing collected packages: ffmpeg-python, tiktoken, openai-whisper
Successfully installed ffmpeg-python-0.2.0 openai-whisper-20230314 tiktoken-0.3.1
100%|██████████████████████████████████████| 2.87G/2.87G [00:24<00:00, 129MiB/s]
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Chinese
[00:00.840 --> 00:02.960] 刚刚测完3080Ti
[00:02.960 --> 00:06.120] 我们又给大家测3070Ti了
[00:06.120 --> 00:09.600] 本来这么多新品扎堆出应该是很让人兴奋的一件事
......
[02:32.800 --> 02:34.440] 尤其是在4K分辨率下
[02:34.440 --> 02:37.360] 3070Ti的表现是会比较受限的
[02:37.360 --> 02:40.520] 另一个游戏赛博朋克2077也是差不多的情况
[02:40.520 --> 02:44.680] 不开光追的时候3070Ti能比3070强出一些
[02:44.680 --> 02:46.760] 但差距没有之前那么大
[02:46.760 --> 02:48.200] 当打开了光追之后
[02:48.200 --> 02:50.600] 3070Ti就和3070拉不开差距了
[02:50.600 --> 02:54.600] 尤其是4K下和3080还差了挺多的
[02:54.600 --> 02:55.480] 最后一个游戏
[02:55.480 --> 02:56.640] 地铁增强版
[02:56.640 --> 02:57.720] 情况也是差不多
[02:57.720 --> 03:00.720] 3070Ti稍强于3070
[03:00.720 --> 03:03.280] 但不管是2K还是4K分辨率
[03:03.280 --> 03:05.360] 差距都不是特别大
.....
[05:32.440 --> 05:34.080] 放个指导价有什么用呢
[05:34.080 --> 05:35.960] 说得好像谁能买到一样
[05:35.960 --> 05:38.400] 所以你要是对3070Ti感兴趣的话
[05:38.400 --> 05:39.520] 就慢慢等吧
[05:39.520 --> 05:42.720] 这个卡跟什么3060之类的不太一样
[05:42.720 --> 05:45.400] 它降到原价之后还是值得考虑的
[05:45.400 --> 05:45.800] 好了
[05:45.800 --> 05:47.840] 以上就是本期节目的全部内容了
[05:47.840 --> 05:50.040] 喜欢的话不妨长按点赞一键三连
[05:50.040 --> 05:51.560] 我们下次再见了
[05:51.560 --> 05:52.040] 拜拜
DONE!

最后一步，通知。

我们得让我们知道当前已经完成，这样才安心，对我iOS用户来说，我使用bark方式进行推送通知

就是iPhone上去搜索Bark后，会给你一串地址，长这样

1	https://api.day.app/一段token/信息文本

所以，我再写一段简单的pyhton代码，上面安装的requests这用起来了


import requests
from urllib.parse import quote as urlencode
requests.get("https://api.day.app/一段token/" + urlencode("Colab 字幕处理已经完成！"))

效果如下

bark-recieve

最后Notebook的笔记如下

from google.colab import drive
drive.mount('/content/gdrive')
!mkdir /content/gdrive/MyDrive/srtFiles

!export YT_VIDEO="5wF1YItz78Y" && wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -O yt-dlp && chmod +x yt-dlp && ./yt-dlp -f 140 "https://www.youtube.com/watch?v=$YT_VIDEO" -o download.m4a
!pip install openai-whisper requests
!whisper --model large-v2 --output_format srt --output_dir outsrt download.m4a && echo "DONE!"
!cp outsrt/download.srt /content/gdrive/MyDrive/srtFiles

import requests
from urllib.parse import quote as urlencode
requests.get("https://api.day.app/一段token/" + urlencode("Colab 字幕处理已经完成！"))