Organizing Data Through the Lens of Deduplication
Our home file server has been running since 2008, and over the last 12 years, it has accumulated more than 4 TB of data. The storage is shared between four people, and it tends to get disorganized over time. We also had a problem with duplicated data (over 500 GB of wasted space), an issue that is intertwined with disorganization. I wanted to solve both of these problems at once, and without losing any of our data. Existing tools didn’t work the way I wanted, so I wrote Periscope to help me clean up our file server.
Periscope works differently from most other duplicate file finders. It’s
designed to be used interactively to explore the filesystem, understand which
files are duplicated and where duplicates live, and safely delete duplicates,
all without losing any data. Periscope enables exploring the filesystem with
standard tools — the shell, and commands like cd
, ls
, tree
, and so on
— while providing additional duplicate-aware commands that mirror core
filesystem utilities. For example, psc ls
gives a directory listing that
highlights duplicates, and psc rm
deletes files only if a duplicate copy
exists. Here is Periscope in action on a demo dataset:
The demo uses a small synthetic dataset. For the real thing, there were a lot more duplicates; here are the stats prior to the cleanup:
$ psc summary
tracked 669,718
unique 175,672
duplicate 494,046
overhead 515 GB
Early attempts
The first time I tried to clean up our file server, I used well-known duplicate file finders like fdupes and its enhanced fork jdupes. At a high level, these programs scan the filesystem and output a list of duplicates. After that, you’re on your own. When I scanned the server, the tools found 494,046 duplicate files wasting a total of 515 GB of space. Going through these manually, one at a time, would be infeasible.
Many tools have a mode where they can prompt the user and delete files; with such a large number of duplicates, this would not be useful. Some tools have features that help with space savings but not with organization: hard linking duplicates, automatically deleting duplicates chosen arbitrarily, and automatically deleting duplicates chosen based on a heuristic like path depth. These features wouldn’t work for me.
I had a hypothesis that a lot of duplicate data was the result of entire directories being copied, so if the duplicate summary could merge duplicate directories rather than listing individual files, the output might be more manageable. I tried implementing this functionality, and I soon found out that this merging strategy works well for perfect copies, but it does not work well when folders have partial overlap, and most of the duplicate data on the server was like that. I tried to work around the issue and handle partial overlap through analyzing subset relationships between directories, but I basically ended up with a gigantic Venn diagram; I couldn’t figure out a clean and useful way to visualize the information.
Patterns of disorganization
I manually inspected some of the data on our file server to understand where duplicates came from and how they should be cleaned up, and I started noticing patterns:
- A directory of organized data alongside a “to organize” directory. For example, we had organized media in “/Photos/{year}/{month}/{event name}”, and unorganized media in “/Unorganized”, in directories like “D300S Temp Copy Feb 11 2012”. In some cases the data inside the copies was fully represented in the organized photos directory hierarchy, but in other cases there were unique files that needed to be preserved and organized.
- A directory snapshotted at different times. In many cases, it wasn’t necessary to keep multiple backups, we just needed the full set of unique files.
- A redundant backup of an old machine. Nowadays we use Borg for machine backups, but in the past, we had cases where entire machines were backed up temporarily, such as before migrating to a new machine. Most of this data was copied to the new machine and subsequently backed up as part of that machine’s backups, but the old copy remained. Most of this data could be deleted, but in some cases there were files that were unique and needed to be preserved.
- Duplicates in individuals’ personal directories. We organize some shared data like photos in a shared location, and other data in personal folders. We had some data that was copied in both locations.
- Manually versioned documents. We had documents like “Essay.doc”, “Essay v2.doc”, “Essay v3.doc”, where some of the versions were identical to each other.
Generalizing from these patterns, I felt that an interactive tool would work
best for cleaning up the data. The tool should support organizing data one
directory at a time, listing directories and inspecting files to understand
where duplicates live. I also wanted a safe wrapper around rm
that would let
me delete duplicates but not accidentally lose data by deleting a unique file.
Additionally, I wanted a way to delete files in one directory only if they were
present in another, so I could recursively delete everything in “/Unorganized”
that was already present in “/Photos”.
Periscope
Periscope implements the functionality summarized above. A psc scan
searches for duplicate files in the same way as other duplicate file finders
but it caches the information in a database. After that, commands like psc ls
can run fast by leveraging the database. Commands like psc summary
and psc report
show high-level information on duplicates, psc ls
and psc info
enable interactively exploring the filesystem, and psc rm
safely deletes
duplicates.
More information on the Periscope workflow and commands is available in the documentation.
Related work
There are tons of duplicate file finders out there — fdupes, jdupes, rmlint, ddh, rdfind, fslint, duff, fddf, and fclones — to name a few. These tools find and print out duplicates; some have additional features like prompting for deletion or automatically deleting dupes based on heuristics. They were not suitable for my use case.
dupd is a utility that scans for duplicates, saves information to a database,
and then allows for exploring the filesystem while querying the duplicate
database for information. It was a source of inspiration for Periscope. The
tools have somewhat differing philosophies and currently have two key
differences: Periscope aims to provide commands that mirror coreutils
counterparts (e.g. psc ls
is not recursive, unlike dupd), and Periscope
provides commands to safely delete files (one of dupd’s design goals is to not
delete files). These seem essential for “scaling up” and handling a large
volume of duplicates.
Download
Periscope is free and open source software. Documentation, code, and binaries are available on GitHub.