I gave up on training Tesseract because I couldn’t figure out how to make the open source training tools work, so I downloaded a 45 day trial of the LabVIEW Vision toolkit and had decent success after training the time, velocity, and altitude fonts. The LabVIEW Vision toolkit has a training GUI that’s intuitive and easy to use if you want to train a new language from images. Training the time and velocity font was very effective: for the time font, I trained 112 characters total, with as few as 3 samples for some characters, and the program recognized 100% of the NS11 times and velocities correctly.
The altitude font was more difficult. The kerning of the altitude font led the OCR program to interpret a digit followed by a comma as a single character, with the combined digit-comma character reported as unknown. For example, 177,024 would be identified as 17?024. However, the Vision toolkit “count objects” function (which i think is called blob detection in image processing parlance) would correctly identify each individual character, but it would report them out of order. For example, it would count 7 characters in 177,024 but it would identify them as 0 1 7 7 2 4 ,. Since it reported the detected object region of interest, it was a simple matter to sort them left to right and then run OCR character-by-character.
The final difficulty on the altitude font was “ghost characters”. Because the altitude changes quickly and my image acquisition process is crudely take a screenshot every second, the altitude images have characters that are a combination of one or more digits. No amount of training solved this problem and after much training on the altitude font, I could only identity 94% of the NS11 altitudes correctly (all incorrect altitudes had a ghost character). I think an OCR engine with a more sophisticated neural network (like Tesseract) and font library would handle this problem better. Downloading the video and running OCR on the individual frames is another solution, and would deliver higher resolution data.
I downloaded the NS10 webcast and repeated the whole process; the OCR routine recognized 100% of timestamps, 99.1% of speeds, and 95.8% of altitudes “correctly”. Correctly means the identified point wasn’t obviously wrong and I didn’t go back manually to check the screenshot and fix it. The velocity errors were due to confusion between 6 and 8, and the altitude errors were all ghost characters.