I of course would want to code this in C#, and preferably as a web application. It is then clear to me, that I’ll need a method of looking into a PDF document, and extracting the textual contents (be it heading, table content or regular text), and display these – for indexing purposes.
Following that, I’ll need to make a database containing metadata about each given document. I’ll also need to develop a method of searching.
As for reading a PDF file in C#. Numerous Google searches led me to this Blog post, explaining the usage of PDFBox. PDFBox was originally a Java implementation, but is usable from C#. It’s a pretty large library too. After some testing, I found that this library is not for me. Since it can’t handle most of my PDF files that have been made using OCR (Scanned and then OCR).
A quick Google search, and I found this Stackoverflow article. It handles the problem: “PDFBox chokes on many PDF files which are generated by newer tools, and is not too consistent about PDFs it can handle”. Post #1 then suggests this tool: PDFTextStream. It’s a library for .Net (and Java and Python) that can extract Text – Just my wish. Except.. It’s not free!.
Alas. I’m back to start. Another mention was the Java tool, PdfToText. It can be downloaded from here, as a precompiled Windows binary. I tried using this, and it outputted just the text I needed. For Free!.
The plan is then to use this command line utility from the application. It will export a text file that can be read into the application.