James Adam Buckland
Home · Blog · Resume

video2anki: OCR for onscreen multilanguage audio captions


Pleco is an Android Chinese/English dictionary app. Users can look up and save words to a personal dictionary, and export an XML file containing those words and their definitions.

Anki is a flashcard app (and website) which implements spaced repetition. People use it to learn all sorts of things, but mostly languages.

Last year I wrote a custom pipeline which takes as input Pleco’s exported XML and produces as output a set of flashcards suitable for consumption by Anki. I could now mark a character or phrase in Pleco and expect to review it in Anki some time later that week.

Types of information in Anki

Anki presents information in the form of flash cards: given the front of some card, one is meant to recall the contents on the back. For the Chinese language, the important forms of a word are:

written characters in the original script
romanized characters which phonetically imitate the original pronunciation
usually a definition in English
usually a recording of a human or machine speaking the word or phrase

Types of review in Anki

I configured Anki to help me review words and phrases in each of the following directions:

hanzi+meaning → pinyin+audio
(tests my ability to pronounce a word or phrase)
hanzi+pinyin+audio → meaning
(tests my ability to understand a word or phrase)
pinyin+audio+meaning → hanzi
(tests my ability to recall and write a character)
hanzi → pinyin+meaning+audio
(tests my ability to read, pronounce, and understand a character)
audio → meaning+pinyin+hanzi
(tests my ability to hear and understand a character)


This process improved my reading and writing abilities for short words and phrases. But you can’t learn grammar from context-free words, and you can’t train your ability to hear and understand a full sentence from single words or phrases. What I want is a new kind of card:

audio → meaning+pinyin+hanzi
(tests my ability to hear and understand a sentence)


Many languages have large amounts of free educational content. Sometimes this free content comes with oncreen subtitles, which many people use while watching to check their understanding. To watch a piece of television and quiz myself as effectively as Anki, I would need to hide and show the subtitles while rewinding to the beginning of each sentence fragment. I found that this got in the way of learning, and made it difficult to review a large volume of material easily.


My solution was this:

  1. Import a video containing both Chinese-language audio and onscreen hanzi captions.
  2. Determine the timestamps at which sentence breaks appear.
  3. Extract the hanzi from the on-screen subtitles via OCR.
  4. Extract the audio from the video.
  5. Split the audio and the extracted hanzi by timestamp, creating pairs of audio fragments and written sentence fragments.
  6. Format those pairs and import them into Anki.

I used Python because I knew all of the requisite libraries (video and audio manipulation, fast OCR, video anlysis for cutscene detection) would be available. This script only needs to run once per video, so performance is not a priority.

Consuming video

I used pytube to download the highest-resolution version of a video from YouTube.

from pytube import YouTube
yt = YouTube("some_youtube_url")
download_path = yt.streams

Detecting scene transitions

I used pyscenedetect to detect scene transitions: points in time when some percentage of pixels change.

video_manager = scenedetect.VideoManager(["some_local_file"])
scene_manager = scenedetect.SceneManager()
scenelist = scene_manager.get_scene_list()
imagepaths_by_scenenum = scenedetect.scene_manager
    .save_images(scenelist, video_manager,
    output_dir="some_output_directory", show_progress=True)
with open("some_output_file", "w") as f:
    scenedetect.scene_manager.write_scene_list(f, scenelist)

I found that a threshold of 1 was necessary. I also found I needed to crop the video before running scene detection, since I didn’t want to detect when the speaker moved their hands or when the stock photo changed.

A low threshold meant I often accidentally inferred a scene transition mid-sentence. We deduplicate these later on.

One side-effect of cutscene detection is a dictionary which maps scene number to an individual midpoint frame from that scene. We can extract the image of the sentence from that frame.

Cropping the video

As mentioned above, I cropped the video before running scene detection.

I used ffmpeg to perform video manipulation. I also used ffplay to preview manipulation in real time before undergoing the expensive conversion process.

Specific crop geometry is expressed as a command of the form width:height:xoffset:yoffset. video2anki.py accepts a flag to tune the crop command so that I can preview the crop and tune the command when rerunning.

ffplay -i "<video_path>" -vf "crop=$CROP_COMMAND"
ffmpeg -i "<video_path>" -filter:v "crop=$CROP_COMMAND" -c:a copy "<output_path>"


Reading plaintext out of an image is called OCR (optical character recognition). There are many libraries and toolkits for doing this. I began with pytesseract, but found low-quality results. Small errors with image margin, spacing, and resolution often led to inaccuracies. I eventually found cnocr which had much higher accuracy for recognizing hanzi.

cn = cnocr.cn_ocr.CnOcr()
hanzi_obj = cn.ocr(img_fp=FOO)
hanzi_txt = ''.join(flatten([chrs for (chrs, _) in hanzi_obj]))
# Example: `汉字`

I decided not to use OCR to consume the pinyin from the image. Pinyin isn’t a real language, so there is no pre-trained OCR model which knows how to read its diacritics. I ended up pulling in another library (pinyin) to convert my extracted hanzi into pinyin.

pinyin_txt = pinyin.get(hanzi_txt, delimiter=" ").strip()
# Example: `hànzì`

Audio extraction

I needed to convert the downloaded video into an audio file. I have found that FLAC works as expected on every platform, so I encoded all audio clips to FLAC. I also found that ffmpeg required two passes in order to convert m4a video into flac audio.

ffmpeg -i IN_PATH -c copy OUT_M4A_PATH
ffmpeg -i OUT_M4A_PATH -c:a flac OUT_FLAC_PATH

Audio clipping

I then needed to trim the audio file into sentence fragments. I used sox to perform audio trimming and compression.

fm = sox.Transformer()
tfm.trim(start_sec, end_sec)
tfm.build_file("input_file", "output_file")


Then we write the trimmed files to the Anki media collection, write a CSV with the audio path, the hanzi, the pinyin, and the source video, and import that CSV into Anki.

Here is a video recording in which I review a single card in Anki which was produced with this method. The video shows me opening a card and listening to the autoplayed audio. Then I type in what I think was said. Then I press “Show Answer”, and Anki presents a diff between my answer and the canonical one. Then I press the “Link” button and watch that sentence in the original video on YouTube.

Email · GitHub · LinkedIn · Instagram · RSS