Procedure

What’s a good OCR?

  • A good OCR has high accuracy.
  • It may be better to keep the OCR result as separate pages, and have them associated with the original images. This can help in the following ways:
    • helps proofreading, as described here.
    • it may be helpful to include with each entry a link to the page in the book, so that even if the reader/user suspects OCR errors, they can click on the link and see the original page. (I’ve used this feature in the MW dictionary a few times.)

How to OCR?

Software

Desktop/ web based, in roughly decreasing order of popularity

Android

  • zoho doc scanner
  • Textfairy uses Tesseract, and is open source and free. OCRing sanskrit using hindI pack is unsatisfactory. Sanskrit support is requested [github].
    • Google lens

Scanning

See scanning page.

vflat ocr (paid)

  • pdf-ing, ocr-ing beyond 10 images needs paid app.
  • Supposedly it handles 2-column text well, and exports pdf as well as text well. Besides camera, it seems to import pdf as well.
  • South Korean product depending on Google Vsion handles several indian languages.

Libraries

  • doc-curation - Can OCR some pdf with google drive. Automatically splits into 25 page bits and ocrs them individually. (Quota: Queries per 100 seconds Global : 10000, Per day: 1,000,000,000). See usage example here, functionhere.

  • Google Vision

    • script here : You may get an offer to avail USD 300 credit for usage. Accept that. It may enable you to OCR without charges for a few thousand images.
    • alternative in doc-curation package.
  • ocropy

  • tessIndic

  • tess-parichit

  • tessHindI

  • Comparison

    • 2018 - google ocr vs sanskritocr
    • In 2021, Google Drive OCR occassionally messed up devanAgarI words embedded within english text - SP thread - so as to coincidentally appear (based on stroke similarity) to translate (विधि → fate).
  • Sanskrit OCR guide by dhaval here.

I am using Microsoft document AI for OCR and i think it is best in class. I have continuosly tried many but its OCR is best better than even google also gives structured output of paragraphs. For hindi it has 99% accuracy, for sanskrit it gets trapped when words get complicated because i think it somewhat uses ML for better prediction in OCR otherwise it is fairly good. For fresh account it gives 200$ credit for one month which can ocr around 1.2-1.3 lakh pages. So a bulk data can be done in one go.

Claudia doesn’t support Hindi, but GPT is quite effective. After performing OCR, GPT could be used to assign a confidence score to each recognized word. For words with low confidence, GPT could suggest the correct word based on the sentence context. Users could then easily replace the incorrect words by clicking a simple tick mark next to the suggested corrections. This also works better in hindi, sanskrit i have doubts.

  • BlackNote, 2024

Training data

  • wikisource pages
    • Techniques
      • Process html source (example) to find tags like .
      • Get it from index pages (example).
    • konkaNI vishvakosha index.
    • meghasandesha here.

Other collections

Machine-generated data

  • Also, Tesseract-OCR has a program called text2image which takes unicode text and can create images files in different fonts as well as apply some degradation to it so simulate scanned pages. The program doesn’t compile/work on windows, but works on Linux.
  • One can even bootstrap using output of other OCR tools.
  • See repositories of relevant indic OCR projects.

Online

  • sanskritdictionary - Uses drive API for Google. Provides tesseract alternative as well.
  • akSharamukhA - Uses tesseract, which runs entirely on browser!

Post processing

  • Ideally, we would get OCR-s from multiple sources and then combine them to reduce errors (yet to try this).

Fora

Alternatives to OCR

PDF text extraction: see thread here.