Pleco is an Android Chinese/English dictionary app. Users can look up and save words to a personal dictionary, and export an XML file containing those words and their definitions.
Anki is a flashcard app (and website) which implements spaced repetition. People use it to learn all sorts of things, but mostly languages.
Last year I wrote a custom pipeline which takes as input Pleco’s exported XML and produces as output a set of flashcards suitable for consumption by Anki. I could now mark a character or phrase in Pleco and expect to review it in Anki some time later that week.
Anki presents information in the form of flash cards: given the front of some card, one is meant to recall the contents on the back. For the Chinese language, the important forms of a word are:
I configured Anki to help me review words and phrases in each of the following directions:
This process improved my reading and writing abilities for short words and phrases. But you can’t learn grammar from context-free words, and you can’t train your ability to hear and understand a full sentence from single words or phrases. What I want is a new kind of card:
Many languages have large amounts of free educational content. Sometimes this free content comes with oncreen subtitles, which many people use while watching to check their understanding. To watch a piece of television and quiz myself as effectively as Anki, I would need to hide and show the subtitles while rewinding to the beginning of each sentence fragment. I found that this got in the way of learning, and made it difficult to review a large volume of material easily.
My solution was this:
I used Python because I knew all of the requisite libraries (video and audio manipulation, fast OCR, video anlysis for cutscene detection) would be available. This script only needs to run once per video, so performance is not a priority.
pytube to download the highest-resolution version
of a video from YouTube.
from pytube import YouTube yt = YouTube("some_youtube_url") download_path = yt.streams .get_highest_resolution() .download(output_path="some_path")
to detect scene transitions: points in time when some percentage of pixels
video_manager = scenedetect.VideoManager(["some_local_file"]) scene_manager = scenedetect.SceneManager() scene_manager.add_detector( scenedetect.detectors.ContentDetector(threshold=1)) video_manager.set_downscale_factor() video_manager.start() scene_manager.detect_scenes(frame_source=video_manager) scenelist = scene_manager.get_scene_list() imagepaths_by_scenenum = scenedetect.scene_manager .save_images(scenelist, video_manager, output_dir="some_output_directory", show_progress=True) with open("some_output_file", "w") as f: scenedetect.scene_manager.write_scene_list(f, scenelist)
I found that a threshold of
1 was necessary. I also found I needed to crop the
video before running scene detection, since I didn’t want to detect when the
speaker moved their hands or when the stock photo changed.
A low threshold meant I often accidentally inferred a scene transition mid-sentence. We deduplicate these later on.
One side-effect of cutscene detection is a dictionary which maps scene number to an individual midpoint frame from that scene. We can extract the image of the sentence from that frame.
As mentioned above, I cropped the video before running scene detection.
ffmpeg to perform video
manipulation. I also used
ffplay to preview manipulation in real time before
undergoing the expensive conversion process.
Specific crop geometry is expressed as a command of the form
video2anki.py accepts a flag to tune the crop command so that I can preview the
crop and tune the command when rerunning.
CROP_COMMAND="in_w:in_h/3:0:1.9*in_h/3" ffplay -i "<video_path>" -vf "crop=$CROP_COMMAND" ffmpeg -i "<video_path>" -filter:v "crop=$CROP_COMMAND" -c:a copy "<output_path>"
Reading plaintext out of an image is called OCR (optical character recognition).
There are many libraries and toolkits for doing this. I began with
pytesseract, but found low-quality
results. Small errors with image margin, spacing, and resolution
often led to inaccuracies. I eventually found
cnocr which had much higher accuracy for
cn = cnocr.cn_ocr.CnOcr() hanzi_obj = cn.ocr(img_fp=FOO) hanzi_txt = ''.join(flatten([chrs for (chrs, _) in hanzi_obj])) # Example: `汉字`
I decided not to use OCR to consume the pinyin from the image. Pinyin isn’t a
real language, so there is no pre-trained OCR model which knows how to read its
diacritics. I ended up pulling in another library (
pinyin) to convert my
extracted hanzi into pinyin.
pinyin_txt = pinyin.get(hanzi_txt, delimiter=" ").strip() # Example: `hànzì`
I needed to convert the downloaded video into an audio file. I have found that
FLAC works as expected on every platform,
so I encoded all audio clips to FLAC. I also found that
ffmpeg required two
passes in order to convert
m4a video into
ffmpeg -i IN_PATH -c copy OUT_M4A_PATH ffmpeg -i OUT_M4A_PATH -c:a flac OUT_FLAC_PATH
I then needed to trim the audio file into sentence fragments. I used sox to perform audio trimming and compression.
fm = sox.Transformer() tfm.trim(start_sec, end_sec) tfm.compand() tfm.build_file("input_file", "output_file")
Then we write the trimmed files to the Anki media collection, write a CSV with the audio path, the hanzi, the pinyin, and the source video, and import that CSV into Anki.
Here is a video recording in which I review a single card in Anki which was produced with this method. The video shows me opening a card and listening to the autoplayed audio. Then I type in what I think was said. Then I press “Show Answer”, and Anki presents a diff between my answer and the canonical one. Then I press the “Link” button and watch that sentence in the original video on YouTube.