Calibre is far from ideal so I wonder if there is a better way to convert a PDF into EPUB? Maybe a new AI tool exist for that purpose? What do you use?
I had this exact question myself a little while ago, so I’ll share what I learned. I don’t know your level of knowledge with these things so forgive me if I’m explaining things you already know. And spoiler alert, the answer is “technically, but not how you’d like”
An EPUB “file” is really a folder containing a bunch of individual HTML files which hold the text for the book as well as things like the table of contents, and photos (if your ebook has pictures), with CSS for styling. This is the exact medium you’d work in if you were designing a web page, but with en ebook there are different best practices and considerations.
Now assuming that your PDF has a good OCR (optical character recognition) layer, then it will be possible for calibre and other programs to grab the text of the PDF, and even to create an epub with it. But as you’ve noticed, they don’t do a good job of this. The fundamental problem is that creating an epub is something of an art, with best practices and personal choices as far as layout and file structure. When you “convert”, you’re not changing the file type from PDF to EPUB; you’re grabbing the text from the PDF and then sticking it into multiple different files, with HTML and CSS instructions throughout to tell the EReader how to lay things out, which footnotes link to which annotations, where to display pictures, etc.
As far as I’m aware, this basically can’t be done (well) with dumb, automatic programs like what Calibre offers because there’s too much “thinking” involved. Perhaps an AI tool could be created that would handle this better, but I’m not aware of one, and it’s a pretty specialised application so it’s possible you’ll need to wait a while before someone gets around to that.
So I realised that if I wanted an EPUB version, I’d need to make it myself. I used Sigil, a free EPUB creation tool, to do it, which gave me some nice features to help speed up the process, but it’s a big time commitment (unless you’re working with a very short PDF), especially for your first EPUB where you’re still learning what to do while making it. You’ll also need to learn HTML and CSS if you haven’t already.
I did it as a sort of fun side project in my free time to learn a new skill, but unfortunately other than that, I don’t think there’s such thing as an “EPUBinator” that’s gonna take your PDF and create a well-made ebook.
You’ve identified the main issue: PDF extraction. A PDF can lay out pages in an infinite number of ways.
My personal workflow is to take a PDF, tun it through ClearType OCR, save it as a web-friendly, accessibility standard compliant PDF, which will extract all the text and re-lay it out so a screen reader can read the text in the correct order.
After that, it’s a matter of exporting the PDF to HTML, chunking it, zipping the results with a CSS file and a manifest, and you’ve got an ePub.
And of course, there are Python libraries to do a lot of the conversion as well.
The ideal would have to be some sort of AI translation. The problem is that PDF is a page layout format and EPUB is a reading format and you can’t just extract the text without understanding what parts are affected by page layout, think of reading by columns for example. And you would need to train the AI on what’s unnecessary for reading comprehension
By “far from ideal”, I think you mean “not perfect”.
No. They mean really bad. OP is being overly polite.
And ugly!
An ugly powerhouse Linux application? What will they think of next?!
Yeah! Like Audacity!
I can’t do complicated PDFs and sometimes I still need to use OCR, depending on how the text is applied, but I usually use pdftotext on the command line, then make a bash script with a while loop to read each line and make a few modifications like adding a space to the end and removing the newline from the end of each line. This reflows the text so that can be wrapped or processed further.
Okular also has some powerful text tools built in. I don’t think there is anything fully automated though. I use it for extracting tables as it can arbitrarily extract columns and rows.
It has been several years since I did this one, but there are ways of extracting the text automatically using OCR on the command line. I don’t recall the exact details, but the info is out there if you look hard enough. I used it for a bunch of old scanned datasheets for vintage computing hardware. The problem was the scan quality versus OCR was just not good enough and required manual intervention in many cases for things like 0:O, S:5, 1:l, etc. IIRC the toolchain I came up with was based on Tesseract 3.
I have an Android app now that is based on Tesseract 5, I rarely use it, but when I have, the results are flawless. Maybe this will lead you to something useful: