Calibre is far from ideal so I wonder if there is a better way to convert a PDF into EPUB? Maybe a new AI tool exist for that purpose? What do you use?

Cc @nostupidquestions@lemmy.world

  • j4k3@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I can’t do complicated PDFs and sometimes I still need to use OCR, depending on how the text is applied, but I usually use pdftotext on the command line, then make a bash script with a while loop to read each line and make a few modifications like adding a space to the end and removing the newline from the end of each line. This reflows the text so that can be wrapped or processed further.

    Okular also has some powerful text tools built in. I don’t think there is anything fully automated though. I use it for extracting tables as it can arbitrarily extract columns and rows.

    It has been several years since I did this one, but there are ways of extracting the text automatically using OCR on the command line. I don’t recall the exact details, but the info is out there if you look hard enough. I used it for a bunch of old scanned datasheets for vintage computing hardware. The problem was the scan quality versus OCR was just not good enough and required manual intervention in many cases for things like 0:O, S:5, 1:l, etc. IIRC the toolchain I came up with was based on Tesseract 3.

    I have an Android app now that is based on Tesseract 5, I rarely use it, but when I have, the results are flawless. Maybe this will lead you to something useful:

    https://github.com/SubhamTyagi/android-ocr