PDF Search Site

Background tools

I of course would want to code this in C#, and preferably as a web application. It is then clear to me, that I’ll need a method of looking into a PDF document, and extracting the textual contents (be it heading, table content or regular text), and display these – for indexing purposes.

Following that, I’ll need to make a database containing metadata about each given document. I’ll also need to develop a method of searching.

As for reading a PDF file in C#. Numerous Google searches led me to this Blog post, explaining the usage of PDFBox. PDFBox was originally a Java implementation, but is usable from C#. It’s a pretty large library too. After some testing, I found that this library is not for me. Since it can’t handle most of my PDF files that have been made using OCR (Scanned and then OCR).

A quick Google search, and I found this Stackoverflow article. It handles the problem: “PDFBox chokes on many PDF files which are generated by newer tools, and is not too consistent about PDFs it can handle”. Post #1 then suggests this tool: PDFTextStream. It’s a library for .Net (and Java and Python) that can extract Text – Just my wish. Except.. It’s not free!.

Alas. I’m back to start. Another mention was the Java tool, PdfToText. It can be downloaded from here, as a precompiled Windows binary. I tried using this, and it outputted just the text I needed. For Free!.

The plan is then to use this command line utility from the application. It will export a text file that can be read into the application.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

For spam filtering purposes, please copy the number 3966 to the field below: