Linux incremental hardlink backup system

Introduction

While reviewing my old college’s IT class backup system (due to running low on disk space), I discovered a major flaw: a weekly 1:1 backup of every user’s files—about 40 GB per week. Many of these files never change. Ever.

Solution: Incremental backups.

I challenged myself to code an incremental backup system using only Python and the Linux extfs filesystem.

The Scenario

At my former college, students ran the IT department. They set up a system with a 2 TB drive (two 1 TB disks) that made weekly full backups of every user, including alumni.

It was neat—users could browse weekly snapshots through subdomains like week35.rtgkom.dk. For instance:

Problem: Disk usage ballooned due to full 1:1 weekly backups. I needed a transparent, filesystem-only solution (no DBs, no viewers).

Theories

I discovered hardlinking, which links multiple filenames to the same disk block. It means multiple folders can share the same file without duplicating space.

Idea:

Backup each unique file once.
Hardlink all future identical files to the original.
Use SHA1 hashes to identify file uniqueness.

Example:

If 100 users install WordPress (9 MB each), that’s 900 MB total. With hardlinking, we can store just 9 MB + config/plugin deltas.

A “data” folder holds unique files named by SHA1. Backup folders hardlink to these.

The Script

Written in Python using the built-in optparse.

Command:

python backupsystem.py --help

Options:

--rootfolder=ROOTFOLDER        Source directory to mirror from
--backupfolder=BACKUPFOLDER    Destination snapshot folder
--datafolder=DATAFOLDER        Directory for unique hash-named files
--datalevels=N                 Subdirectory levels inside datafolder (default: 0)
--createdirs                   Assume yes to all directory creation prompts
--pauseonerror                 Pause on errors
--awaituserinput               Wait for user input before backup starts
--folderspause                 Wait before processing folders
--filespause                   Wait before processing files
--dostats                      Show stats at end (requires -v)
--doprogress=N                 Show progress every N files (requires -v)
--ignorefileexists             Ignore "file exists" errors
--hash=HASHMETHOD              Hash method (e.g., sha1)
-v                             Verbose output (repeat for more detail)

Example:

sudo python backupsystem.py --rootfolder=/files/ --backupfolder=/backup/week10/ --datafolder=/backup/data

This creates week10 (snapshot) and data (unique content) in /backup/.

Use --createdirs to suppress folder creation prompts.

Filecount Consideration

To avoid sluggish file managers (e.g., Windows Explorer), use --datalevels=N to create subfolders:

With --datalevels=3, file AABBCCDDEEFF is stored as:

/backup/data/AA/BB/CC/AABBCCDDEEFF

This spreads files across multiple directories.

Usage

Keep some parameters consistent (datafolder, datalevels). Changing them resets the system.

Only the backupfolder changes between runs—it represents a snapshot.

Note: Editing any file in a snapshot affects all linked copies!

Advantages & Considerations

Pros:

Saves space through deduplication and hardlinking.
Transparent snapshots without special viewers.

Cons:

Centralized data increases corruption risk (RAID suggested).
Editing one file alters all linked instances.

Script Download

Script available for educational use: Download Backupsystem.py

This system was designed as a simple, transparent backup solution that leverages hardlinking and hashing for efficient storage.

Introduction#

The Scenario#

Theories#

Idea:#

Example:#

The Script#

Command:#

Options:#

Example:#

Filecount Consideration#

Usage#

Advantages & Considerations#

Pros:#

Cons:#

Script Download#

Introduction

The Scenario

Theories

Idea:

Example:

The Script

Command:

Options:

Example:

Filecount Consideration

Usage

Advantages & Considerations

Pros:

Cons:

Script Download