Organizing Data Through the Lens of Deduplication

3 Aug 2020 — shared on Lobsters, Reddit, and Twitter

Our home file server has been running since 2008, and over the last 12 years, it has accumulated more than 4 TB of data. The storage is shared between four people, and it tends to get disorganized over time. We also had a problem with duplicated data (over 500 GB of wasted space), an issue that is intertwined with disorganization. I wanted to solve both of these problems at once, and without losing any of our data. Existing tools didn’t work the way I wanted, so I wrote Periscope to help me clean up our file server.

Periscope works differently from most other duplicate file finders. It’s designed to be used interactively to explore the filesystem, understand which files are duplicated and where duplicates live, and safely delete duplicates, all without losing any data. Periscope enables exploring the filesystem with standard tools — the shell, and commands like cd, ls, tree, and so on — while providing additional duplicate-aware commands that mirror core filesystem utilities. For example, psc ls gives a directory listing that highlights duplicates, and psc rm deletes files only if a duplicate copy exists. Here is Periscope in action on a demo dataset:

The demo uses a small synthetic dataset. For the real thing, there were a lot more duplicates; here are the stats prior to the cleanup:

$ psc summary
  tracked 669,718
   unique 175,672
duplicate 494,046
 overhead  515 GB

Early attempts

The first time I tried to clean up our file server, I used well-known duplicate file finders like fdupes and its enhanced fork jdupes. At a high level, these programs scan the filesystem and output a list of duplicates. After that, you’re on your own. When I scanned the server, the tools found 494,046 duplicate files wasting a total of 515 GB of space. Going through these manually, one at a time, would be infeasible.

Many tools have a mode where they can prompt the user and delete files; with such a large number of duplicates, this would not be useful. Some tools have features that help with space savings but not with organization: hard linking duplicates, automatically deleting duplicates chosen arbitrarily, and automatically deleting duplicates chosen based on a heuristic like path depth. These features wouldn’t work for me.

I had a hypothesis that a lot of duplicate data was the result of entire directories being copied, so if the duplicate summary could merge duplicate directories rather than listing individual files, the output might be more manageable. I tried implementing this functionality, and I soon found out that this merging strategy works well for perfect copies, but it does not work well when folders have partial overlap, and most of the duplicate data on the server was like that. I tried to work around the issue and handle partial overlap through analyzing subset relationships between directories, but I basically ended up with a gigantic Venn diagram; I couldn’t figure out a clean and useful way to visualize the information.

Patterns of disorganization

I manually inspected some of the data on our file server to understand where duplicates came from and how they should be cleaned up, and I started noticing patterns:

A directory of organized data alongside a “to organize” directory. For example, we had organized media in “/Photos/{year}/{month}/{event name}”, and unorganized media in “/Unorganized”, in directories like “D300S Temp Copy Feb 11 2012”. In some cases the data inside the copies was fully represented in the organized photos directory hierarchy, but in other cases there were unique files that needed to be preserved and organized.
A directory snapshotted at different times. In many cases, it wasn’t necessary to keep multiple backups, we just needed the full set of unique files.
A redundant backup of an old machine. Nowadays we use Borg for machine backups, but in the past, we had cases where entire machines were backed up temporarily, such as before migrating to a new machine. Most of this data was copied to the new machine and subsequently backed up as part of that machine’s backups, but the old copy remained. Most of this data could be deleted, but in some cases there were files that were unique and needed to be preserved.
Duplicates in individuals’ personal directories. We organize some shared data like photos in a shared location, and other data in personal folders. We had some data that was copied in both locations.
Manually versioned documents. We had documents like “Essay.doc”, “Essay v2.doc”, “Essay v3.doc”, where some of the versions were identical to each other.

Generalizing from these patterns, I felt that an interactive tool would work best for cleaning up the data. The tool should support organizing data one directory at a time, listing directories and inspecting files to understand where duplicates live. I also wanted a safe wrapper around rm that would let me delete duplicates but not accidentally lose data by deleting a unique file. Additionally, I wanted a way to delete files in one directory only if they were present in another, so I could recursively delete everything in “/Unorganized” that was already present in “/Photos”.

Periscope

Periscope implements the functionality summarized above. A psc scan searches for duplicate files in the same way as other duplicate file finders but it caches the information in a database. After that, commands like psc ls can run fast by leveraging the database. Commands like psc summary and psc report show high-level information on duplicates, psc ls and psc info enable interactively exploring the filesystem, and psc rm safely deletes duplicates.

More information on the Periscope workflow and commands is available in the documentation.

There are tons of duplicate file finders out there — fdupes, jdupes, rmlint, ddh, rdfind, fslint, duff, fddf, and fclones — to name a few. These tools find and print out duplicates; some have additional features like prompting for deletion or automatically deleting dupes based on heuristics. They were not suitable for my use case.

dupd is a utility that scans for duplicates, saves information to a database, and then allows for exploring the filesystem while querying the duplicate database for information. It was a source of inspiration for Periscope. The tools have somewhat differing philosophies and currently have two key differences: Periscope aims to provide commands that mirror coreutils counterparts (e.g. psc ls is not recursive, unlike dupd), and Periscope provides commands to safely delete files (one of dupd’s design goals is to not delete files). These seem essential for “scaling up” and handling a large volume of duplicates.

Download

Periscope is free and open source software. Documentation, code, and binaries are available on GitHub.

Early attempts

Patterns of disorganization

Periscope

Related work

Download