Procedure

What’s a good OCR?

  • A good OCR has high accuracy.
  • It may be better to keep the OCR result as separate pages, and have them associated with the original images. This can help in the following ways:
    • helps proofreading, as described here.
    • it may be helpful to include with each entry a link to the page in the book, so that even if the reader/user suspects OCR errors, they can click on the link and see the original page. (I’ve used this feature in the MW dictionary a few times.)

How to OCR?

Software

Desktop/ web based, in roughly decreasing order of popularity

Android

  • zoho doc scanner
  • Textfairy uses Tesseract, and is open source and free. OCRing sanskrit using hindI pack is unsatisfactory. Sanskrit support is requested [github].
    • Google lens

Scanning

See scanning page.

vflat ocr (paid)

  • pdf-ing, ocr-ing beyond 10 images needs paid app.
  • Supposedly it handles 2-column text well, and exports pdf as well as text well. Besides camera, it seems to import pdf as well.
  • South Korean product depending on Google Vsion handles several indian languages.

Misc apps

Ocr Scanner, Google Lens, Pramukh OCR etc.

PIOCR: Piocr is too good, especially with multi column pdf/image files. Using Paint, draw vertical lines to separate multi columns, save and upload the image for best output. Not a free package, but 5 free scans per day.

Libraries

  • doc-curation - Can OCR some pdf with google drive. Automatically splits into n page bits and ocrs them individually. (Quota: Queries per 100 seconds Global : 10000, Per day: 1,000,000,000). See usage example here, functionhere.

  • Google Vision

    • script here : You may get an offer to avail USD 300 credit for usage. Accept that. It may enable you to OCR without charges for a few thousand images.
    • alternative in doc-curation package.
  • ocropy

  • tessIndic

  • tess-parichit

  • tessHindI

  • Comparison

    • 2018 - google ocr vs sanskritocr
    • In 2021, Google Drive OCR occassionally messed up devanAgarI words embedded within english text - SP thread - so as to coincidentally appear (based on stroke similarity) to translate (विधि → fate).
  • Sanskrit OCR guide by dhaval here.

I am using Microsoft document AI for OCR and i think it is best in class. I have continuosly tried many but its OCR is best better than even google also gives structured output of paragraphs. For hindi it has 99% accuracy, for sanskrit it gets trapped when words get complicated because i think it somewhat uses ML for better prediction in OCR otherwise it is fairly good. For fresh account it gives 200$ credit for one month which can ocr around 1.2-1.3 lakh pages. So a bulk data can be done in one go.

Claudia doesn’t support Hindi, but GPT is quite effective. After performing OCR, GPT could be used to assign a confidence score to each recognized word. For words with low confidence, GPT could suggest the correct word based on the sentence context. Users could then easily replace the incorrect words by clicking a simple tick mark next to the suggested corrections. This also works better in hindi, sanskrit i have doubts.

  • BlackNote, 2024

Training data

  • wikisource pages
    • Techniques
      • Process html source (example) to find tags like .
      • Get it from index pages (example).
    • konkaNI vishvakosha index.
    • meghasandesha here.

Other collections

Machine-generated data

  • Also, Tesseract-OCR has a program called text2image which takes unicode text and can create images files in different fonts as well as apply some degradation to it so simulate scanned pages. The program doesn’t compile/work on windows, but works on Linux.
  • One can even bootstrap using output of other OCR tools.
  • See repositories of relevant indic OCR projects.

Online

LLM AI

“Would you say that processing PDFs using multi-modal LLMs is always guaranteed to produce more accurate output compared to using plain OCR engines (Tesseract/ Google Vision)?”

While it hedges its bets as it is wont to do because I asked for confirmation in 100% of the cases, the answer must be read as an definite YES. One of the biggest reasons being: LLMs have access to the entire context while “dumb” OCR engines do not.

“If I gave an LLM a PDF of an English translation of the Ramayana and it encounters a word that reads like ‘Alexander,’ what would it do? Simply add it to the output stream, or try to figure out if the word makes sense in the context?”

The answer is a definite NO to the “simply add it to the output stream.” It WILL try to figure out if the word makes sense in the context of the rest of the document together with comparison to the visual evidence.

Prompts and usage elsewhere.

  • Dharmamitra (Google Gemini, billing covered by grant money)

Google vision

  • sk individual billing

Tesseract

  • akSharamukhA - Uses tesseract, which runs entirely on browser!
  • anunAd’s collab notebook
  • sw
  • charya

Grantha-ocr

Gathering training data [here](https://docs.google.com/document/d/19a5Qjc4BXItn9TXJ3yh6u1i0mRUyK05wMZCO8sMX_GQ/edit?tab=t.0

Collecting training data for grantha script OCR)

Post processing

  • Ideally, we would get OCR-s from multiple sources and then combine them to reduce errors (yet to try this).

Fora

Alternatives to OCR

PDF text extraction: see thread here.