Introduction
While reviewing my old college’s IT class backup system (due to running low on disk space), I discovered a major flaw: a weekly 1:1 backup of every user’s files—about 40 GB per week. Many of these files never change. Ever.
Solution: Incremental backups.
I challenged myself to code an incremental backup system using only Python and the Linux extfs filesystem.
The Scenario
At my former college, students ran the IT department. They set up a system with a 2 TB drive (two 1 TB disks) that made weekly full backups of every user, including alumni.
It was neat—users could browse weekly snapshots through subdomains like week35.rtgkom.dk
. For instance:
- My site: http://rtgkom.dk/~michaelgb07/
- My news feed: index.php?menu=0&sub=1
- Week 1 copy: Week01
Problem: Disk usage ballooned due to full 1:1 weekly backups. I needed a transparent, filesystem-only solution (no DBs, no viewers).
Theories
I discovered hardlinking, which links multiple filenames to the same disk block. It means multiple folders can share the same file without duplicating space.
Idea:
- Backup each unique file once.
- Hardlink all future identical files to the original.
- Use SHA1 hashes to identify file uniqueness.
Example:
If 100 users install WordPress (9 MB each), that’s 900 MB total. With hardlinking, we can store just 9 MB + config/plugin deltas.
A “data” folder holds unique files named by SHA1. Backup folders hardlink to these.
The Script
Written in Python using the built-in optparse.
Command:
python backupsystem.py --help
Options:
--rootfolder=ROOTFOLDER Source directory to mirror from
--backupfolder=BACKUPFOLDER Destination snapshot folder
--datafolder=DATAFOLDER Directory for unique hash-named files
--datalevels=N Subdirectory levels inside datafolder (default: 0)
--createdirs Assume yes to all directory creation prompts
--pauseonerror Pause on errors
--awaituserinput Wait for user input before backup starts
--folderspause Wait before processing folders
--filespause Wait before processing files
--dostats Show stats at end (requires -v)
--doprogress=N Show progress every N files (requires -v)
--ignorefileexists Ignore "file exists" errors
--hash=HASHMETHOD Hash method (e.g., sha1)
-v Verbose output (repeat for more detail)
Example:
sudo python backupsystem.py --rootfolder=/files/ --backupfolder=/backup/week10/ --datafolder=/backup/data
This creates week10
(snapshot) and data
(unique content) in /backup/
.
Use --createdirs
to suppress folder creation prompts.
Filecount Consideration
To avoid sluggish file managers (e.g., Windows Explorer), use --datalevels=N
to create subfolders:
With --datalevels=3
, file AABBCCDDEEFF
is stored as:
/backup/data/AA/BB/CC/AABBCCDDEEFF
This spreads files across multiple directories.
Usage
Keep some parameters consistent (datafolder, datalevels). Changing them resets the system.
Only the backupfolder
changes between runs—it represents a snapshot.
Note: Editing any file in a snapshot affects all linked copies!
Advantages & Considerations
Pros:
- Saves space through deduplication and hardlinking.
- Transparent snapshots without special viewers.
Cons:
- Centralized data increases corruption risk (RAID suggested).
- Editing one file alters all linked instances.
Script Download
Script available for educational use: Download Backupsystem.py
This system was designed as a simple, transparent backup solution that leverages hardlinking and hashing for efficient storage.