Blocking Ads From 120,000+ Domains

Thanks to some members of the community, I was able to improve the speed and number of domains that the gravity-adv.sh script.  This script is for use with the Raspberry Pi ad-blocking Pi-hole, which can now be installed automatically.  It now can block over 120,000 ad domains.  Thanks everyone who helped.

6 Replies to “Blocking Ads From 120,000+ Domains”

  1. Ji Jacob..
    great tutorial. I’ve set it up on a virtual machine running Open Media vault for testing.
    it works great.

    Did you make any progress on the youtube ads thing? I’d love to be able to skip those annoying ads.. my daughter can never wait the 5 secs and always ends up clicking the next video… 🙂

    1. I have made progress on blocking video ads. However, I am short on time so it may be a while before I can update the site.

  2. Hi Jacob,

    I ran your most recent gravity.sh to fetch the 120000+ ads block list. It creates a 40MB adlist.conf. a quick “head -n 100 adlist.conf” will show you what is going on in that file. all this hundreds of thousands of entries are not necessary to block these ads.

    address=/02w0blksmj.centade.com/127.0.0.1
    address=/02wa2i1cng.centade.com/127.0.0.1
    address=/02wut9eqqk.centade.com/127.0.0.1
    address=/02y4idifex.centade.com/127.0.0.1

    see above sample content.
    running “cat adList.conf | grep “centade.com” | wc -l” will tell you that you adlist.conf is devoting 8397 lines of entries to block various ads from centade.com. do a google on centads.com. you will see that centads.com is a typical ad serving host.

    now one does not need 8397 entries to block ads from centads.com. we can just use one single line:

    address=/centade.com/127.0.0.1

    this is enough!.

    same with
    “linkbucks.com” 2233 entries
    “babygirl-mail.info” 109 entries
    “doubleclick.net” 12519 entries
    “healthoffers.org” 110 entries

    So, these following 4 lines will cover all 23368 lines of entry in you adlist.conf
    address=/centade.com/127.0.0.1
    address=/babygirl-mail.info/127.0.0.1
    address=/doubleclick.net/127.0.0.1
    address=/healthoffers.org/127.0.0.1

    having arbitrarily large number of entries in adlist.conf will only slow down your browsing as dnsmasq has to go through those 120000 entries for each url request and then decide the outcome! this is too much work even for a huge desktop, let alone tiny rpi. we need to put much more though and analysis before adding entries to adlist.conf. not sure its quite achievable with bash script or not. I will shortly start rewriting the gravity bash scripts into my php script on the webgui. I will add these sort of analysis and filtering with them.

    At this point I already have similar filtering at place working when you import adlist.conf into database. when importing a adlist.conf into gui database from system info page, the php script works hard finding such redundant entries and ignores them. see a sample output of import functionality:

    Starting Import ‘op’=2, file: /etc/dnsmasq.d/adlist-custom.conf
    SUCESSFULLY OPEN DATABASE FILE!
    Entries Entred: 0 Entries
    Entries Ignored: 123 Entries
    Time Spend: 1.228 Seconds
    Entry Ignored: [something.aboutads.info] Existing Entry Found: [aboutads.info] (Duplicate or redundent entry)
    Entry Ignored: [otherthing.ad-x.co.uk] Existing Entry Found: [ad-x.co.uk] (Duplicate or redundent entry)
    Entry Ignored: [somethingelse.adbayes.com] Existing Entry Found: [adbayes.com] (Duplicate or redundent entry)
    Entry Ignored: [test.addthis.com] Existing Entry Found: [addthis.com] (Duplicate or redundent entry)
    Entry Ignored: [ads.ads.adentifi.com] Existing Entry Found: [adentifi.com] (Duplicate or redundent entry)

    1. We have tried this before but in my testing, I have found that doing this ends up blocking a lot of legitimate services. I’m certainly open to looking into the issue again, but with so many domains, it would be near impossible to verify the legitimacy of each one against their subdomains.

      In response to this, my latest version of the gravity script puts all of the domains into the hosts file, which does not seem to affect performance at all, even with over 900,000 domains in the list.

      I understand your reasoning and it makes complete sense, but think about sites like Hulu. If we block things using the TLD (by removing the subdomain when counting it as a duplicate), we would be blocking hulu.com (a legitimate service).

      address=/hulu.com/127.0.0.1

      but our real intention is to block only the ads from Hulu:

      address=/ads.hulu.com/127.0.0.1
      address=/ads-a-darwin.hulu.com/127.0.0.1

      I love the idea and if we could get it to work without disrupting services. Try using pizzafritta’s code to sort the list and see if you can browse the Web without issue. I was barely able to do anything.

      Also, I do not want my job to become maintaining a huge list (the people managing the lists already do that for us). I am far more interested in writing a script to get all that information and then use it to block ads. So as long as the performance of the Pi-hole isn’t being affected, I probably won’t be putting in any effort to it.

      I know you have already made an entire Web interface for the Pi-hole, which is amazing and as I mentioned, I’m not completely against the idea if it works. The great thing about open source software is that you can fork my project and do something like that (and make a pull request if you get it working). Likewise, I could use your WebGUI to supplement the Pi-hole (which I hope to do once I can get things more stable).

      Let me know your thoughts.

      1. Hi Jacob,

        true what you said. But from observation, I found issues like hulu.com are in extreme minority in comparison to 120000+ entries. hulu.com type issues can be solve using your white-list.

        consolidating many sub-domain entries into base domain can be done to entries which has more then 10 sub-domain entries. so the algorithm goes something like this:

        we break every url to its base domain name (something.com or etc)
        then we keep count of how many sub-domain entry exist for that base-domain name. where number of entries are more then 10, we discard all the sub-domain entries and only enter a single block entry for the base-domain. or else we keep the multi sub-domain.base-domain.somthing entries.

        what is the likelihood of a legit domain to serve ads from 10+ sub-domain? also this method can be farther tweaked changing the number 10 into number ‘n’. where ‘n’ can be any number 10, 20, 30 etc.

        I admit I have it lot easier then you guys. its far easier for me to write all that in php then for you to write all that in awk and sed. hence I am not yet forking your bash project and just rewriting parts of it with php. :p

        anon

        1. I’m not completely against it as I mentioned, but when I did try to parse down the list from 120,000 to about 97,000, and tried to use it while browsing. I ran into more problems loading regular content.

          I would be interested to see the list you come up with so I can compare it to the one I was using.

          I would like to have a Pi-hole Web interface at some point, too, like what you are working on. I think if we collaborate some more, we can make something very cool.

Leave a Reply