How I Went Paperless And Clutter-free For $32

Ever since I discovered Dropbox, I felt weighed down by the clutter of physical documents.  I use it for all of my digital documents.  After a while, physical documents and pieces of paper began to get on my nerves.  It was much easier to have searchable PDFs that I could access anywhere.  To that end, I wanted to be able to convert any physical document I received into a searchable PDF.  See the video below for the entire workflow in action.

I Used The Following:

  • A Mac
  • An iPhone
  • Dropbox (for storing and syncing the files across my devices)
  • Scanner Pro by Readdle (for scanning at home or on-the-go) $3
  • Hazel (the magic that makes all the automation possible) $29
  • terminal-notifier (displays notifications as documents are processed.  Growl could also be used)
  • ocrmypdf (for making my scans searchable via Optical Character Recognition)
  • pdfgrep (for finding content within the PDFs without having to open them)
  • tag (for systematically tagging files for use with Spotlight and Finder)
  • The scripts I developed for processing everything via Hazel

Envisioning The Workflow

My goal was to get rid of any piece of paper that was handed to me as quickly as possible.  To that end, here is what I envisioned:

  • Immediately scan a document whenever I received one
  • The document automatically gets uploaded to Dropbox
  • When the file is synced to my Mac, a set of Hazel rules run to:
    • Convert the “image” PDF into a text-searchable PDF
    • Delete the original PDF
    • Organize the PDFs into folders based on their content
    • Make the contents of the PDF searchable in Spotlight using tags

Setting It Up

Dropbox – Installed And Connected To Hazel

This should be installed to your Mac and your iPhone.

Scanner Pro – Scanning Documents At Home Or On-the-go

Install Scanner Pro and enable auto-upload to Dropbox.  I just saved them in my root Dropbox folder as the Hazel rules would just get move them to the next folder once they were done being processed.

enableautoupload-scannerpro-dropbox

Hazel – Automatically Converting The Scanned PDFs

If you are sick of doing anything over and over again, Hazel will be well worth your money.  It helps automate mundane tasks.  There is plenty you can do in Hazel without scripting, but if you know how write code, the possibilities are endless.

This was where things started to get difficult as it required some heavy scripting to properly process the files and make them searchable, as well as to move them into the proper folders and apply the tags.

The first Hazel rule is shown below.  It looks for PDFs whose name starts with Scan, which is the default filename Scanner Pro sets.  I didn’t want to change this each time; I just wanted to hit save > upload and then let the automation take care of the rest.  But I also did not like the naming convention they chose.  I prefer sorting them by date in the format YYYY-MM-DD.HH:MM.  This keeps them in a chronological format.  Since I was going to be tagging them anyway, this was just fine.

So first, I made a rule for my main Dropbox folder called OCR new PDFs from ScannerPro.

phase1-ocr-pdf

This rule looks for PDFs whose filename starts with Scan (what Scanner Pro names files by default) and then runs the script found below.

rule1-criteria

Here is script for copying/pasting, but you will need to adjust the desiredDir variable to your Dropbox folder.

When the script finishes processing the PDFs, I get a notification courtesy of terminal-notifier.  There is even a little QuickLook preview of the document on the far right.  This gives me the chance to see what document was processed.  Plus, it just looks nice. ocr-done

Sorting The PDFs

Now that the PDFs have been OCR’d and are text-searchable, they should have been moved into the desiredFolder (my Dropbox Documents folder in this example) and are ready to be sorted into their proper folders.  There are a number of ways you could do this, but I settled on searching for content within the PDF and sorting them based on that.

Determining What Type Of PDF This Is

As an example, I wanted to sort my bills into one folder, my insurance documents into another, my apartment lease into another, etc.  Many of these documents always contain the same information.  For example, my insurance documents always have the phrase “Verification of Insurance for.” pdfgrep-insurance Knowing this, I can use pdfgrep to search the content of any PDF that has been processed by the first script and if it is successful (meaning it found a match), I can move it into the correct folder.  To search it, I would run a command like this: pdfgrep -i 'verification of insurance for' 2015-11-20-15.06.pdf   Notice I used the -i option which mean to ignore case sensitivity.  Since OCR isn’t perfect, this gives the command a higher success rate (i.e. it might mistake a capital P for a lowercase p).

Set Up The Hazel Rule

I made one rule that applies to my Dropbox Documents folder called Sort Insurance Documents.  Once I made this first rule, I was able to duplicate it for all the different documents that I wanted to sort.  All I needed to do what change a few variables.   phase2-sort-based-on-content   rule2   Below is the script that you can copy/paste and then modify the variables to suit your needs.

The way the script above works is by checking the exit status of the pdfgrep command.  If it is a 0, it was successful (found a match) and if it is anything else, it failed.  So I just turned this logic into a simple if statement (this command has to run immediately after the pdfgrep command to work properly).  Then, all I needed to do was duplicate the rule, but change the search phrase and folder to move it into.

more-sorting

Admittedly, this is far from perfect as it depends greatly on the quality of your scans and the integrity of you pdfgrep search queries, but it’s a start in the right direction.

 

11 Replies to “How I Went Paperless And Clutter-free For $32”

  1. You inspired me with your post to do something similar. However, I chose to split the scripts differently:

    * I am using Hazel to rename the File. It has a nice feature where you can define a custom match rule for the input and use that custom rule in the rename (see screenshots).

    * After renaming, I call a single Python script that does OCR and tagging, similar to your script. For tagging, however, I am using a Python script that does extract the text from the OCRed PDF using Poppler’s (http://poppler.freedesktop.org) pdftotext. I then use fuzzysearch (https://pypi.python.org/pypi/fuzzysearch) to search the extracted text for matches of tag signifiers, up to a maximum Levenshtein distance of 3 (the less distance the better). This makes up for quality differences in the OCR, where maybe some character will be swapped by some similar, or where characters will be split/combined (e.g. rn m)

    * Once the file is tagged, I am using Hazel again to move the file to the target destination according to the tag.

    Thank you again for your post.

    1. So here I spent all my time scripting my way around renaming the file like a chump when the capability was there all along! 😓

      pdftotext and fuzzysearch sound a little better than what I am doing. I think I will definitely revisit this and take a look at these new tools. Thanks so much!

      1. Did you ever play around with updating this? I have been messing with it also using different software on my phone, so I removed all the renaming stuff anyway before reading the comments here! Pfew 🙂

        1. I never did, no. Too lazy I guess. It has been working fine so I never re-did it. I may look at this again as I manage my health record workflow…

    2. So are you actually making the text file or just sending it to STDOUT? My goal was to keep the PDFs in PDF format for portability, plus any images are still retained.

      1. I am keeping the PDF, just like you. However, I found that the output of pdftotext is nicer to deal with – the tool tries to make intelligent guesses where columns can be found, etc.

        Here’s the my script for OCR, grepping, and tagging. My tags are in a dictionary, in the form tags = {"Text to Grep": "Tag"}: https://gist.github.com/nd-net/21b0c64c62a3959f596d (gist because disqus’s editor really does not like indentation)

        I am using a different OCR program as I could not convince ocrmypdf to use German, but the result is pretty much the same. The only different to your approach is that I modify the PDF in place.

    3. …what about security?

      Are you not worried… I mean it’s not encrypted, what if Dropbox get hacked?
      I’m thinking of doing something similar but instead of Using Dropbox I might use own cloud (running on my Qnap NAS).
      My mobile Clients will have VPN Connection. So at least the Data stays in my home…

      1. Security is a balance without it getting in the way. There are risks everywhere, but if you take the proper precautions, you can prevent a lot of issues. OwnCloud is definitely one way to go.

  2. I also have found an issue with the way Hazel processes rules. It will only apply the first rule that matches. So I had to change the “Do the following…” section to have the first item be “Continue matching rules” then “Run shell script”. Otherwise it would only process the first rule I had created since it returned a match. The script not processing properly doesn’t have any effect on Hazel moving to the next rule for processing.

    1. Yes, I noticed this change with the new version. It did not used to behave that way. Tanks for the tip!

Leave a Reply