An Introduction
So, I’m moving into an apartment in the near future. Following that, I will have to learn how to manage my documents and ensure I can find them back in time. I’ve thought of a few solutions, one of them being what I do now – where I simply have a folder with PDFs of my documents (I digitalize everything). This works fine now… (I have like … 20 documents)… But it may not work in 2 years when I have a hundred documents spread over many different corporations and years.
I then thought of making the system searchable. This means I need a method of indexing all documents, preferably long-term. I thought of making a website, where I attach metadata to each uploaded document (that I make searchable by using OCR). Metadata will include things like the start and end date, a descriptive title, and perhaps one or two categories.
The idea is to archive all documents and bills from cable, ISP, heating, electricity, etc., with their validity dates. When I search for a given date, I want to see all documents that cover that date and match those criteria (for example, “Electricity”).
Using a system like this, it’d be easy to go back in time and view bills from a provider at a given date—without pulling down a 4 KG Ring Binder full of mixed pages.
Background Tools
I, of course, would want to code this in C#, preferably as a web application. I’ll need a method to extract text from PDF documents (headings, tables, regular text) for indexing purposes.
For reading PDFs in C#, I found this blog post about PDFBox. It’s originally a Java library but can be used in C#. After testing, I found that PDFBox doesn’t handle most of my OCR-scanned PDFs.
A Stack Overflow article confirmed this problem. The first post recommends PDFTextStream, which works well but isn’t free.
Back to square one, I found PdfToText, a simple command-line tool available here. It outputs exactly what I need—for free!
The plan is to use this utility in the application. It exports a text file that the app can read.
The Database
The database is an important part. I plan to have tables for:
- Users: Security is important.
- Documents: The core of the product.
- Categories: To assist with sorting.
Initially, I won’t store files in the DB—just text for search. I’ll store filename, hash (SHA1), upload date, and the extracted text. The hash maps to the physical file.
Each document also gets a date range (datefrom
, dateto
). If dateto
is empty, only datefrom
is relevant.
The Website
I’ll code the site in C#. This time, I’m using LINQ. It allows direct queries on my database in code (not using ADO.Net). Really handy for fast development.
I’m also using .ashx handlers to provide downloads without exposing file locations. This lets me use the app_data
folder securely.
Key features:
- PDF extraction
- Private security considerations
- Document encryption
- Searching algorithm
- Download page
PDF Extraction
I settled on Xpdf for extracting text.
private void PDFToText(string tempName, out string error, out string text)
{
string path_exe = Server.MapPath("~/app_data/temp/pdftotext.exe");
string path_pdf = Server.MapPath("~/app_data/temp/" + tempName);
string path_txt = Server.MapPath("~/app_data/temp/" + tempName + ".txt");
var pstart = new System.Diagnostics.ProcessStartInfo
{
FileName = path_exe,
Arguments = $"\"{path_pdf}\" \"{path_txt}\"",
UseShellExecute = false,
RedirectStandardOutput = false,
RedirectStandardError = true,
CreateNoWindow = true
};
var p = new System.Diagnostics.Process { StartInfo = pstart };
p.Start();
p.WaitForExit();
text = File.Exists(path_txt) ? File.ReadAllText(path_txt) : "";
if (File.Exists(path_txt)) File.Delete(path_txt);
error = p.StandardError.ReadToEnd();
}
Considerations of Private Security
Some files contain personal data (e.g., Danish CPR numbers). I created a regex to remove patterns like xxxxxx-xxxx
, replacing them with {CPR REMOVED}
.
Encryption of Documents
I encrypt uploaded files using the user’s password and TripleDES, using this wrapper.
If a user changes their password, all documents are re-encrypted. A dencrypthash
field tracks which password was used. Not the most secure system—more “security by obscurity”—but sufficient for my use.
Searching Algorithm
This is my proudest part. I used to build huge SQL queries. With LINQ, I build object-based filters.
I:
- Filter by categories (inclusion/exclusion)
- Filter by date ranges
- Extract full-text content
- Use splitters to tokenize text
- Search by name/content/tags using different matching methods (Begins With, Ends With, AND, OR)
Despite being a bit slow (½–1s per search), it’s flexible and understandable.
Download Page
I use .ashx
handlers to control downloads securely. The app_data
folder is perfect since IIS blocks direct access.
Handlers check session login and document ownership. Then they decrypt the file if necessary and stream it back to the user.
Conclusion
This project taught me a lot about LINQ, secure file storage with app_data
, and practical encryption. I now have a working document archive system that I use personally.
The site is live with 23 MB of PDFs and growing.
I’m as happy with this as I was with Mobifinance. The DocsArchive site was previously available here (link may be broken).