OCR using Machine Learning?

Does this book go into using machine learning to do OCR?

Hi @dexxta, thanks for your interest in the book! :slight_smile: There’s a chance we’ll cover OCR to some degree – do you have a specific use case you’d like to see covered? In case you find this helpful, we have a tutorial on the site about using Google’s Tesseract library to add OCR to apps – you can check it out here: https://www.raywenderlich.com/306-tesseract-ocr-tutorial-for-ios

The final NLP project is still TBD and we might end up doing something like an app that uses your camera to provide realtime translations of text. However, that’s still up in the air and I can’t promise anything yet, so if OCR is your primary reason for considering the book, than I would recommend holding off until we have a definitive answer.

Please let me know if you have any other questions or concerns about the book!


i second this.
OCR (and other uses of it like NLP, auto-tagging, finding the font, filling a form, etc) is super important.
Apple natively has a text, words and letters detector, but not OCR.

Tesseract is super awful, try starting a new project and try using it. If you get it to work, you’re already a champion, it’s pretty much impossible, and if you can, good luck to do normal debugging and optimization.

Native detection + CoreML recognition would be good.
Real time translation with AR (like Google AR Translate) or video superimposition (with tracking) would be awesome.

Hey @einharch, thanks for the info! I haven’t actually ever used Tesseract – that stinks that it stinks! :frowning_face:

I see Google’s ML Kit apparently works on iOS, so that might be a good choice if you have any pressing needs. I haven’t tried it but I’ll check it out at some point to see if it would make sense as part of an app in the book. I’d prefer to stick with Apple stuff or write it from scratch, but we also have to balance what’s possible to cover in the space available and with a model trainable with a reasonable amount of data and compute.

Of course, I won’t be surprised if whatever we include needs to be completely replaced in a second edition, since I expect Apple to continue adding built-in features each year. :crazy_face:

Frankly I am a little shocked that Apple built the entire text detection vision system without some way to recognise the actual text. I have found a few examples of how to train and use ML Files to recognise letters, so am in the process of writing my own version to suit the requirements. Just thought it might be something people would be very interested in.

I am sure text recognition is less useful in a business app than determining if a picture is of a male or female… Sorry for exploding the sarcasm meter haha.

Still looking forward to the book though. :slight_smile:


It’s definitely something would be interested in. And I agree, it seems like an oversight by Apple. They’ll tell you where they find characters, but then you need to figure out what those characters are yourself? I’d be surprised if we don’t see it next year. (Ideally sooner, but you know how Apple likes to save all their updates for WWDC. :wink:)

You can use the Vision API to locate the bounding boxes of characters, and then run those boxes through a model to identify the characters. That will work well, especially if you only need it to work in a known environment (e.g. a specific font, well-defined background, etc). The more it needs to support, the more data you’ll need to train it, though.

I know it’s something people want, so we’ll figure out if it fits anywhere. Good luck with your project!

@clapollo Thanks for the answer.

Yes, seems like Tesseract has been kept much like a research project, and never really made developer-friendly. Using it, especially in a modern swift project will make any patient person give up after pulling all their hair.

MLKit is really good and multiplatform, a good point for normal developer. But being part of FireBase makes it inusable into a framework (one of my jobs is to distribute fully embedded frameworks, and while CoreML models can be embedded and even downloaded from a server or linked to IBM Watson, FireBase being a framework in itself can not be embedded and distributed as part of another framework).

I really hope you make some kind of translation app exercice, or maybe a form input (like what Apple does to redeem an iTunes card, or WiFi password settings from a picture).
That kind of text/numbers detection is easy, just need to train something like SwiftOCR.

The biggest challenge: Train the model to detect all letters in all languages!!!
English alphabet is easy, same for Japanese Kana, but if you add all Kanjis (~60K), and the fact that many kanjis not only look exactly the same as the Kana (ex. the kana “Ka”: カ, and the kanji “Chikara”, power: 力) but even are just a combination of other kanjis (like Kyou: 協). This is super confusing to train, and the more languages you add, the less accurate the model becomes.

I am no ML/AI specialist, so I’d love to see some courses on how to solve these kind of problems, or different approaches (like maybe splitting the models by language, then using the NSLanguageTagger to throw the recognition to the correct model???)

@dexxta I think the reason Apple avoided the text recognition part is probably because:

  1. Having to deal with all languages (something that Apple always does better than Google, “most” their APIs are usable smoothly across languages)
  2. Avoiding to step on Tesseract that is owned by google now

But seing that google does that natively in FireBase, I’d love to see Apple at least opening their own tags as well (the ones they use in Photos, including the Facial Recognition, not only detection).
They are tagging billions of photos, and even have access to things like GPS data, tilt, camera focal data, etc. They can build the best tagging system ever.