Table of Contents
Pleco is an Android Chinese/English dictionary app. Users can look up and save words to a personal dictionary, and export an XML file containing those words and their definitions.
Last year I wrote a custom pipeline which takes as input Pleco’s exported XML and produces as output a set of flashcards suitable for consumption by Anki. I could now mark a character or phrase in Pleco and expect to review it in Anki some time later that week.
Types of information in Anki⌗
Anki presents information in the form of flash cards: given the front of some card, one is meant to recall the contents on the back. For the Chinese language, the important forms of a word are:
- written characters in the original script
- romanized characters which phonetically imitate the original pronunciation
- usually a definition in English
- usually a recording of a human or machine speaking the word or phrase
Types of review in Anki⌗
I configured Anki to help me review words and phrases in each of the following directions:
- hanzi+meaning → pinyin+audio
- (tests my ability to pronounce a word or phrase)
- hanzi+pinyin+audio → meaning
- (tests my ability to understand a word or phrase)
- pinyin+audio+meaning → hanzi
- (tests my ability to recall and write a character)
- hanzi → pinyin+meaning+audio
- (tests my ability to read, pronounce, and understand a character)
- audio → meaning+pinyin+hanzi
- (tests my ability to hear and understand a character)
This process improved my reading and writing abilities for short words and phrases. But you can’t learn grammar from context-free words, and you can’t train your ability to hear and understand a full sentence from single words or phrases. What I want is a new kind of card:
- audio → meaning+pinyin+hanzi
- (tests my ability to hear and understand a sentence)
Many languages have large amounts of free educational content. Sometimes this free content comes with oncreen subtitles, which many people use while watching to check their understanding. To watch a piece of television and quiz myself as effectively as Anki, I would need to hide and show the subtitles while rewinding to the beginning of each sentence fragment. I found that this got in the way of learning, and made it difficult to review a large volume of material easily.
My solution was this:
- Import a video containing both Chinese-language audio and onscreen hanzi captions.
- Determine the timestamps at which sentence breaks appear.
- Extract the hanzi from the on-screen subtitles via OCR.
- Extract the audio from the video.
- Split the audio and the extracted hanzi by timestamp, creating pairs of audio fragments and written sentence fragments.
- Format those pairs and import them into Anki.
I used Python because I knew all of the requisite libraries (video and audio manipulation, fast OCR, video anlysis for cutscene detection) would be available. This script only needs to run once per video, so performance is not a priority.
pytube to download the highest-resolution version of a video from YouTube.
from pytube import YouTube yt = YouTube("some_youtube_url") download_path = yt.streams .get_highest_resolution() .download(output_path="some_path")
Detecting scene transitions⌗
pyscenedetect to detect scene transitions: points in time when some percentage of pixels change.
video_manager = scenedetect.VideoManager(["some_local_file"]) scene_manager = scenedetect.SceneManager() scene_manager.add_detector( scenedetect.detectors.ContentDetector(threshold=1)) video_manager.set_downscale_factor() video_manager.start() scene_manager.detect_scenes(frame_source=video_manager) scenelist = scene_manager.get_scene_list() imagepaths_by_scenenum = scenedetect.scene_manager .save_images(scenelist, video_manager, output_dir="some_output_directory", show_progress=True) with open("some_output_file", "w") as f: scenedetect.scene_manager.write_scene_list(f, scenelist)
I found that a threshold of
1 was necessary. I also found I needed to crop the video before running scene detection, since I didn’t want to detect when the speaker moved their hands or when the stock photo changed.
A low threshold meant I often accidentally inferred a scene transition mid-sentence. We deduplicate these later on.
One side-effect of cutscene detection is a dictionary which maps scene number to an individual midpoint frame from that scene. We can extract the image of the sentence from that frame.
Cropping the video⌗
As mentioned above, I cropped the video before running scene detection.
ffmpeg to perform video manipulation. I also used
ffplay to preview manipulation in real time before undergoing the expensive conversion process.
Specific crop geometry is expressed as a command of the form
width:height:xoffset:yoffset. video2anki.py accepts a flag to tune the crop command so that I can preview the crop and tune the command when rerunning.
CROP_COMMAND="in_w:in_h/3:0:1.9*in_h/3" ffplay -i "<video_path>" -vf "crop=$CROP_COMMAND" ffmpeg -i "<video_path>" -filter:v "crop=$CROP_COMMAND" -c:a copy "<output_path>"
Reading plaintext out of an image is called OCR (optical character recognition). There are many libraries and toolkits for doing this. I began with
pytesseract, but found low-quality results. Small errors with image margin, spacing, and resolution often led to inaccuracies. I eventually found
cnocr which had much higher accuracy for recognizing hanzi.
cn = cnocr.cn_ocr.CnOcr() hanzi_obj = cn.ocr(img_fp=FOO) hanzi_txt = ''.join(flatten([chrs for (chrs, _) in hanzi_obj])) # Example: `汉字`
I decided not to use OCR to consume the pinyin from the image. Pinyin isn’t a real language, so there is no pre-trained OCR model which knows how to read its diacritics. I ended up pulling in another library (
pinyin) to convert my extracted hanzi into pinyin.
pinyin_txt = pinyin.get(hanzi_txt, delimiter=" ").strip() # Example: `hànzì`
I needed to convert the downloaded video into an audio file. I have found that FLAC works as expected on every platform, so I encoded all audio clips to FLAC. I also found that
ffmpeg required two passes in order to convert
m4a video into
ffmpeg -i IN_PATH -c copy OUT_M4A_PATH ffmpeg -i OUT_M4A_PATH -c:a flac OUT_FLAC_PATH
I then needed to trim the audio file into sentence fragments. I used sox to perform audio trimming and compression.
fm = sox.Transformer() tfm.trim(start_sec, end_sec) tfm.compand() tfm.build_file("input_file", "output_file")
Then we write the trimmed files to the Anki media collection, write a CSV with the audio path, the hanzi, the pinyin, and the source video, and import that CSV into Anki.
Here is a video recording in which I review a single card in Anki which was produced with this method. The video shows me opening a card and listening to the autoplayed audio. Then I type in what I think was said. Then I press “Show Answer”, and Anki presents a diff between my answer and the canonical one. Then I press the “Link” button and watch that sentence in the original video on YouTube.