Ever since I discovered Dropbox, I felt weighed down by the clutter of physical documents. I use it for all of my digital documents. After a while, physical documents and pieces of paper began to get on my nerves. It was much easier to have searchable PDFs that I could access anywhere. To that end, I wanted to be able to convert any physical document I received into a searchable PDF.
I Used The Following:
- A Mac
- An iPhone
- Dropbox (for storing and syncing the files across my devices)
- Scanner Pro by Readdle (for scanning at home or on-the-go) $3
- Hazel (the magic that makes all the automation possible) $29
terminal-notifier(displays notifications as documents are processed. Growl could also be used)
ocrmypdf(for making my scans searchable via Optical Character Recognition)
pdfgrep(for finding content within the PDFs without having to open them)
tag(for systematically tagging files for use with Spotlight and Finder)
- The scripts I developed for processing everything via Hazel
Envisioning The Workflow
My goal was to get rid of any piece of paper that was handed to me as quickly as possible. To that end, here is what I envisioned:
- Immediately scan a document whenever I received one
- The document automatically gets uploaded to Dropbox
- When the file is synced to my Mac, a set of Hazel rules run to:
- Convert the “image” PDF into a text-searchable PDF
- Delete the original PDF
- Organize the PDFs into folders based on their content
- Make the contents of the PDF searchable in Spotlight using tags
Setting It Up
Dropbox – Installed And Connected To Hazel
This should be installed to your Mac and your iPhone.
Scanner Pro – Scanning Documents At Home Or On-the-go
Install Scanner Pro and enable auto-upload to Dropbox. I just saved them in my root Dropbox folder as the Hazel rules would just get move them to the next folder once they were done being processed.
Hazel – Automatically Converting The Scanned PDFs
If you are sick of doing anything over and over again, Hazel will be well worth your money. It helps automate mundane tasks. There is plenty you can do in Hazel without scripting, but if you know how write code, the possibilities are endless.
This was where things started to get difficult as it required some heavy scripting to properly process the files and make them searchable, as well as to move them into the proper folders and apply the tags.
The first Hazel rule is shown below. It looks for PDFs whose name starts with Scan, which is the default filename Scanner Pro sets. I didn’t want to change this each time; I just wanted to hit save > upload and then let the automation take care of the rest. But I also did not like the naming convention they chose. I prefer sorting them by date in the format YYYY-MM-DD.HH:MM. This keeps them in a chronological format. Since I was going to be tagging them anyway, this was just fine.
So first, I made a rule for my main Dropbox folder called OCR new PDFs from ScannerPro.
This rule looks for PDFs whose filename starts with Scan (what Scanner Pro names files by default) and then runs the script found below.
Here is script for copying/pasting, but you will need to adjust the
desiredDir variable to your Dropbox folder.
When the script finishes processing the PDFs, I get a notification courtesy of
terminal-notifier. There is even a little QuickLook preview of the document on the far right. This gives me the chance to see what document was processed. Plus, it just looks nice.
Sorting The PDFs
Now that the PDFs have been OCR’d and are text-searchable, they should have been moved into the
desiredFolder (my Dropbox Documents folder in this example) and are ready to be sorted into their proper folders. There are a number of ways you could do this, but I settled on searching for content within the PDF and sorting them based on that.
Determining What Type Of PDF This Is
As an example, I wanted to sort my bills into one folder, my insurance documents into another, my apartment lease into another, etc. Many of these documents always contain the same information. For example, my insurance documents always have the phrase “Verification of Insurance for.”
Knowing this, I can use
pdfgrep to search the content of any PDF that has been processed by the first script and if it is successful (meaning it found a match), I can move it into the correct folder. To search it, I would run a command like this:
pdfgrep -i 'verification of insurance for' 2015-11-20-15.06.pdf Notice I used the
-i option which mean to ignore case sensitivity. Since OCR isn’t perfect, this gives the command a higher success rate (i.e. it might mistake a capital P for a lowercase p).
Set Up The Hazel Rule
I made one rule that applies to my Dropbox Documents folder called Sort Insurance Documents. Once I made this first rule, I was able to duplicate it for all the different documents that I wanted to sort. All I needed to do what change a few variables.
Below is the script that you can copy/paste and then modify the variables to suit your needs.
The way the script above works is by checking the exit status of the
pdfgrep command. If it is a
0, it was successful (found a match) and if it is anything else, it failed. So I just turned this logic into a simple
if statement (this command has to run immediately after the
pdfgrep command to work properly). Then, all I needed to do was duplicate the rule, but change the search phrase and folder to move it into.
Admittedly, this is far from perfect as it depends greatly on the quality of your scans and the integrity of you
pdfgrep search queries, but it’s a start in the right direction.