笔者正在制作少儿编程教育系列视频,发现有大量的英文视频资料值得学习,但是视频中缺少字幕,可能会对学生的学习过程带来困扰。如果能够得到英文字幕,再通过谷歌翻译等工具的使用,就可以进一步生成中文字幕。因此,开始探索快速生成字幕的方法,本文对实现过程进行记录,笔者的计算机使用的是Windows 10 64位操作系统。
注:需要计算机通过某些方法成功访问谷歌!
整体流程可划分为:
安装Python2下载配置ffmepg下载并修改autosub运行命令生成视频字幕下面依次进行展开:
安装Python2这里需要安装Python2,因为后续要调用autosub,而autosub是用Python2编写的。笔者试过使用Python3,可能会需要大量的改动才能运行,最终还是按照autosub的说明安装了Python2。这里推荐下载Anaconda2进行Python2的安装,可以省去可能存在的关联包下载配置的麻烦。此外,autosub中提示安装32位Python,笔者未测试64位Python是否可行。打开下载的Anaconda2(32位)会出现:
点击“Next”:
选择“I Agree”:
笔者习惯选择“All Users”,点击“Next”:
这里推荐选择一个不含空格和中文的文件夹进行安装,比如“D:\Anaconda2”:
不建议勾选第一个选项,可能造成python不同版本间使用的混乱,比如笔者还安装了Python3。第二项默认勾选即可:
点击“Next”,安装完毕:
下载配置ffmepgffmepg主要用于解析视频内的音频。
(1)在网站“https://ffmpeg.zeranoe.com/builds/”中,下载满足操作系统要求的ffmepg版本,笔者下载的是ffmpeg-3.2-win64-static。
(2)将下载后的文件解压,将文件夹重命名为ffmepg(可选),将其整体拷贝到“D:\Anaconda2”内(拷贝到哪个目录可以灵活掌握)。
(3)将解压后文件内的“bin”目录配置到系统环境变量Path中。
首先按下Win+R键,启动运行窗口,输入sysdm.cpl:
点击确定,打开系统控制面板,然后选择高级标签:
点击环境变量,在系统变量的窗口内,找到Path:
选中Path环境变量,点击编辑。选择新建,将刚才解压得到的“bin”文件夹所在目录添加到空白处:
点击确定。到此完成ffmepg的下载和配置。
下载并修改autosubautosub是用于自动生成字幕的工具,在语音转写部分调用的是Google Cloud Speech API。
(1)使用Anaconda2安装autosub
安装Anaconda2后,可以找到工具Anaconda Powershell Prompt (Anaconda2),打开后,输入:
pip install autosub就可以完成autosub的安装。
(2)重命名autosub
autosub安装完成后文件位于“D:\Anaconda2\Scripts”内,将其重命名为“autosub_app.py”。
(3)修改autosub_app.py代码
这里对几处重点的修改展开说明,autosub_app.py的全部代码会在文末给出。
代码第48行,加入“, delete=False”,使临时文件不被删除,也就是将:temp = tempfile.NamedTemporaryFile(suffix='.flac')修改为:
temp = tempfile.NamedTemporaryFile(suffix='.flac', delete=False)代码第127行,加入“.exe”,以保证成功地访问到ffmepg.exe文件,也就是将:exe_file = os.path.join(path, program)修改为:
exe_file = os.path.join(path, program + ".exe")加入proxy信息在引入依赖包后,添加全局proxy_dict,这里只是定义一个字典结构:
proxy_dict = { 'http': 'http://127.0.0.1:8118', 'https': 'https://127.0.0.1:8118', 'use': False}然后修改类SpeechRecognizer,在__init__方法中加入proxy变量,在__call__方法中添加逻辑,根据命令判断是否使用proxy,发出不同的post请求。
此外,建议在抛出requests.exceptions.ConnectionError后加入一条打印提示,否则遇到连接Google服务器异常的情况,也不会提示任何错误,而程序最终会获得一个大小为0的srt字幕文件:
except requests.exceptions.ConnectionError: print "ConnectionError\n" continue类SpeechRecognizer修改后如下:
class SpeechRecognizer(object): def __init__(self, language="en", rate=44100, retries=3, api_key=GOOGLE_SPEECH_API_KEY, proxy=proxy_dict): self.language = language self.rate = rate self.api_key = api_key self.retries = retries self.proxy = proxy def __call__(self, data): try: for i in range(self.retries): url = GOOGLE_SPEECH_API_URL.format(lang=self.language, key=self.api_key) headers = {"Content-Type": "audio/x-flac; rate=%d" % self.rate} try: if self.proxy['use']: resp = requests.post(url, data=data, headers=headers, proxies=self.proxy) else: resp = requests.post(url, data=data, headers=headers) except requests.exceptions.ConnectionError: print "ConnectionError\n" continue for line in resp.content.split("\n"): try: line = json.loads(line) line = line['result'][0]['alternative'][0]['transcript'] return line[:1].upper() + line[1:] except: # no result continue except KeyboardInterrupt: return在main方法内加入proxy参数解析代码,这样就可以通过命令行参数来设置proxy:
parser.add_argument('-P', '--proxy', help="Set proxy server")args = parser.parse_args()if args.proxy: proxy_dict.update({ 'http': args.proxy, 'https': args.proxy, 'use': True})print("Use proxy " + args.proxy)到此,就完成了代码的配置过程,下面就可以通过命令行,运行程序进行字幕生成了。
运行命令生成视频字幕(1)代理配置信息获取(如果使用国外网络,此步可忽略)
首先需要找到计算机代理的配置信息。在win10下,右键点击桌面右下角的网络,然后打开“网络和Internet”设置,点击左侧最下方的代理:
将自动设置代理下的脚本地址,复制粘贴到浏览器地址栏内打开,拉到最下方部分,找到:
var proxy = "PROXY 127.0.0.1:8118; DIRECT;";var direct = 'DIRECT;';这样就能找到proxy的ip和端口设置,即127.0.0.1:8118。(根据不同工具的使用,这里的ip和端口可能会不同。)
(2)字幕提取命令
这里需要再次打开工具Anaconda Powershell Prompt (Anaconda2),将工作目录切换至包含待提取字幕视频的目录内,例如D盘根目录下有一个待提取字幕的视频“01_HowComputersWork_sm.mp4”,首先将工作目录切换至D盘,然后执行命令:
python D:\Anaconda2\Scripts\autosub_app.py -S en -D en -P http://127.0.0.1:8118 .\01_HowComputersWork_sm.mp4如果使用国外网络,即可不配置-P及后面的参数,命令为:
python D:\Anaconda2\Scripts\autosub_app.py -S en -D en .\01_HowComputersWork_sm.mp4运行结果如下:
程序运行最后可能会报WindowsError,笔者还没有找到解决方案,但是这并不影响程序的功能,字幕已经成功生成,可以在D盘根目录下看到“01_HowComputersWork_sm.srt”文件,打开视频导入字幕效果如下:
当然,自动生成的字幕有待进一步审核校验。
autosub_app.py代码:
#!D:\Anaconda2\python.exeimport argparseimport audioopfrom googleapiclient.discovery import buildimport jsonimport mathimport multiprocessingimport osimport requestsimport subprocessimport sysimport tempfileimport wavefrom progressbar import ProgressBar, Percentage, Bar, ETAfrom autosub.constants import LANGUAGE_CODES, \ GOOGLE_SPEECH_API_KEY, GOOGLE_SPEECH_API_URLfrom autosub.formatters import FORMATTERSproxy_dict = { 'http': 'http://127.0.0.1:8118', 'https': 'https://127.0.0.1:8118', 'use': False}def percentile(arr, percent): arr = sorted(arr) k = (len(arr) - 1) * percent f = math.floor(k) c = math.ceil(k) if f == c: return arr[int(k)] d0 = arr[int(f)] * (c - k) d1 = arr[int(c)] * (k - f) return d0 + d1def is_same_language(lang1, lang2): return lang1.split("-")[0] == lang2.split("-")[0]class FLACConverter(object): def __init__(self, source_path, include_before=0.25, include_after=0.25): self.source_path = source_path self.include_before = include_before self.include_after = include_after def __call__(self, region): try: start, = region start = max(0, start - self.include_before) += self.include_after temp = tempfile.NamedTemporaryFile(suffix='.flac', delete = False) command = ["ffmpeg","-ss", str(start), "-t", str( - start), "-y", "-i", self.source_path, "-loglevel", "error", temp.name] subprocess.check_output(command, stdin=open(os.devnull)) return temp.read() except KeyboardInterrupt: returnclass SpeechRecognizer(object): def __init__(self, language="en", rate=44100, retries=3, api_key=GOOGLE_SPEECH_API_KEY, proxy=proxy_dict): self.language = language self.rate = rate self.api_key = api_key self.retries = retries self.proxy = proxy def __call__(self, data): try: for i in range(self.retries): url = GOOGLE_SPEECH_API_URL.format(lang=self.language, key=self.api_key) headers = {"Content-Type": "audio/x-flac; rate=%d" % self.rate} try: if self.proxy['use']: resp = requests.post(url, data=data, headers=headers, proxies=self.proxy) else: resp = requests.post(url, data=data, headers=headers) except requests.exceptions.ConnectionError: print "ConnectionError\n" continue for line in resp.content.split("\n"): try: line = json.loads(line) line = line['result'][0]['alternative'][0]['transcript'] return line[:1].upper() + line[1:] except: # no result continue except KeyboardInterrupt: returnclass Translator(object): def __init__(self, language, api_key, src, dst): self.language = language self.api_key = api_key self.service = build('translate', 'v2', developerKey=self.api_key) self.src = src self.dst = dst def __call__(self, sentence): try: if not sentence: return result = self.service.translations().list( source=self.src, target=self.dst, q=[sentence] ).execute() if 'translations' in result and len(result['translations']) and \ 'translatedText' in result['translations'][0]: return result['translations'][0]['translatedText'] return "" except KeyboardInterrupt: returndef which(program): def is_exe(fpath): return os.path.isfile(fpath) and os.access(fpath, os.X_OK) fpath, fname = os.path.split(program) if fpath: if is_exe(program): return program else: for path in os.environ["PATH"].split(os.pathsep): path = path.strip('"') exe_file = os.path.join(path, program + ".exe") if is_exe(exe_file): return exe_file return Nonedef extract_audio(filename, channels=1, rate=16000): temp = tempfile.NamedTemporaryFile(suffix='.wav', delete=False) if not os.path.isfile(filename): print "The given file does not exist: {0}".format(filename) raise Exception("Invalid filepath: {0}".format(filename)) if not which("ffmpeg"): print "ffmpeg: Executable not found on machine." raise Exception("Depency not found: ffmpeg") command = ["ffmpeg", "-y", "-i", filename, "-ac", str(channels), "-ar", str(rate), "-loglevel", "error", temp.name] subprocess.check_output(command, stdin=open(os.devnull)) return temp.name, ratedef find_speech_regions(filename, frame_width=4096, min_region_size=0.5, max_region_size=6): reader = wave.open(filename) sample_width = reader.getsampwidth() rate = reader.getframerate() n_channels = reader.getnchannels() total_duration = reader.getnframes() / rate chunk_duration = float(frame_width) / rate n_chunks = int(total_duration / chunk_duration) energies = [] for i in range(n_chunks): chunk = reader.readframes(frame_width) energies.app(audioop.rms(chunk, sample_width * n_channels)) threshold = percentile(energies, 0.2) elapsed_time = 0 regions = [] region_start = None for energy in energies: is_silence = energy <= threshold max_exceeded = region_start and elapsed_time - region_start >= max_region_size if (max_exceeded or is_silence) and region_start: if elapsed_time - region_start >= min_region_size: regions.app((region_start, elapsed_time)) region_start = None elif (not region_start) and (not is_silence): region_start = elapsed_time elapsed_time += chunk_duration return regionsdef main(): parser = argparse.ArgumentParser() parser.add_argument('source_path', help="Path to the video or audio file to subtitle", nargs='?') parser.add_argument('-C', '--concurrency', help="Number of concurrent API requests to make", type=int, default=10) parser.add_argument('-o', '--output', help="Output path for subtitles (by default, subtitles are saved in \ the same directory and name as the source path)") parser.add_argument('-F', '--format', help="Destination subtitle format", default="srt") parser.add_argument('-S', '--src-language', help="Language spoken in source file", default="en") parser.add_argument('-D', '--dst-language', help="Desired language for the subtitles", default="en") parser.add_argument('-K', '--api-key', help="The Google Translate API key to be used. (Required for subtitle translation)") parser.add_argument('--list-formats', help="List all available subtitle formats", action='store_true') parser.add_argument('--list-languages', help="List all available source/destination languages", action='store_true') parser.add_argument('-P', '--proxy', help="Set proxy server") args = parser.parse_args() if args.proxy: proxy_dict.update({ 'http': args.proxy, 'https': args.proxy, 'use': True }) print("Use proxy " + args.proxy) if args.list_formats: print("List of formats:") for subtitle_format in FORMATTERS.keys(): print("{format}".format(format=subtitle_format)) return 0 if args.list_languages: print("List of all languages:") for code, language in sorted(LANGUAGE_CODES.items()): print("{code}\t{language}".format(code=code, language=language)) return 0 if args.format not in FORMATTERS.keys(): print("Subtitle format not supported. Run with --list-formats to see all supported formats.") return 1 if args.src_language not in LANGUAGE_CODES.keys(): print("Source language not supported. Run with --list-languages to see all supported languages.") return 1 if args.dst_language not in LANGUAGE_CODES.keys(): print( "Destination language not supported. Run with --list-languages to see all supported languages.") return 1 if not args.source_path: print("Error: You need to specify a source path.") return 1 audio_filename, audio_rate = extract_audio(args.source_path) regions = find_speech_regions(audio_filename) pool = multiprocessing.Pool(args.concurrency) converter = FLACConverter(source_path=audio_filename) recognizer = SpeechRecognizer(language=args.src_language, rate=audio_rate, api_key=GOOGLE_SPEECH_API_KEY, proxy=proxy_dict) transcripts = [] if regions: try: widgets = ["Converting speech regions to FLAC files: ", Percentage(), ' ', Bar(), ' ', ETA()] pbar = ProgressBar(widgets=widgets, maxval=len(regions)).start() extracted_regions = [] for i, extracted_region in enumerate(pool.imap(converter, regions)): extracted_regions.app(extracted_region) pbar.update(i) pbar.finish() widgets = ["Performing speech recognition: ", Percentage(), ' ', Bar(), ' ', ETA()] pbar = ProgressBar(widgets=widgets, maxval=len(regions)).start() for i, transcript in enumerate(pool.imap(recognizer, extracted_regions)): transcripts.app(transcript) pbar.update(i) pbar.finish() if not is_same_language(args.src_language, args.dst_language): if args.api_key: google_translate_api_key = args.api_key translator = Translator(args.dst_language, google_translate_api_key, dst=args.dst_language, src=args.src_language) prompt = "Translating from {0} to {1}: ".format(args.src_language, args.dst_language) widgets = [prompt, Percentage(), ' ', Bar(), ' ', ETA()] pbar = ProgressBar(widgets=widgets, maxval=len(regions)).start() translated_transcripts = [] for i, transcript in enumerate(pool.imap(translator, transcripts)): translated_transcripts.app(transcript) pbar.update(i) pbar.finish() transcripts = translated_transcripts else: print "Error: Subtitle translation requires specified Google Translate API key. \ See --help for further information." return 1 except KeyboardInterrupt: pbar.finish() pool.terminate() pool.join() print "Cancelling transcription" return 1 timed_subtitles = [(r, t) for r, t in zip(regions, transcripts) if t] formatter = FORMATTERS.get(args.format) formatted_subtitles = formatter(timed_subtitles) dest = args.output if not dest: base, ext = os.path.splitext(args.source_path) dest = "{base}.{format}".format(base=base, format=args.format) with open(dest, 'wb') as f: f.write(formatted_subtitles.encode("utf-8")) print "Subtitles file created at {}".format(dest) os.remove(audio_filename) return 0if __name__ == '__main__': sys.exit(main())参考链接:
https://github.com/agermanidis/autosub/issues/31
https://github.com/qq2225936589/autosub/blob/master/autosub_app.py