Intro
In part one of this series, I provided the reason why you would want to prevent collecting data from bot traffic and described the basic process on how to do so. This article gives two different solutions on how to identify the traffic you would want to block in the first place.
Excluding Hits Based on the User Agent
An external party – the IAB - provides to Adobe the list of user agents contained in the default IAB Bot List; and thus, Adobe cannot provide this list to you as a client. As a consolation, Adobe Workspace does provide a “Bot Name” dimension that provides a list of “friendly names” tied to the bot-based user agents. For example:
But having just the friendly names of these bots does not necessarily inform us of the exact criteria the IAB list uses when it identifies bots. And so, as much as we'd like to stop collecting data from every single bot in that list, reality shows that it’s not a simple task.
That being said, we can still deploy a solution that prevents at least some bot hits from being collected. For example, let's say we want to stop collecting all hits from Googlebot. That initially sounds like a daunting task because Googlebot is seemingly omnipresent and ubiquitous. Lucky for us, however, Google has willingly published a list of its Googlebot User-Agent headers, which makes the task of Googlebot identification that much easier to pull off. Interestingly enough, a common string contained in most of the Googlebot User-Agent headers is the term "Googlebot" – what a shock! For example:
From a coding perspective, we can access the User-Agent header of every device via the DOM object known as navigator.userAgent. Here's my browser's User-Agent header, which I produced by typing "navigator.userAgent" into the Developer Tools console:
Produce Your Own List of Bots
If you want to discover more user agent values tied to bot traffic, you could try out the following experiment via multi-suite tagging. Ironically, in this experiment, you will need to (temporarily) increase your server call collection in order to figure out how to decrease your server call collection!
Here’s the list of steps to follow:
- 1. Create a brand-new report suite that you intend on using for only this experiment. This report suite should never be used for actual reporting, etc.
- 2. Copy all of the settings in your production report suite over to this brand-new report suite.
- 3. Leave the IAB Bot Filtering setting enabled for your production report suite but disable the IAB Bot Filtering setting for the new report suite.
- 4. Ensure that both report suites have an eVar set aside to capture the value of navigator.userAgent
- 5. Make a small change to your implementation code by setting the aforementioned eVar equal to navigator.userAgent (via doPlugins or onBeforeEventSend, etc.) For example, if you want to use eVar10 to capture the user agent, the code to run in onBeforeEventSend would look like the following:
content.data.__adobe.analytics.eVar10 = navigator.userAgent;
- 6. Temporarily change your implementation to send all production hits to both report suites at the same time. If your contract does not include secondary server calls, these additional calls will be billed as primary server calls. Regardless, you will not want this to be a permanent change - this change will increase your server call usage - so I recommend reversing the change after a few hours or at most a day.
- 7. After reversing the changes, run a report in both report suites that shows all values collected in the aforementioned User-Agent eVar. In theory, the brand-new report suite’s eVar should have values that are not contained in the production report suite’s eVar – these never-before-seen values would represent hits from bot traffic. The production report suite should still filter out the IAB Bot-based hits from reporting while the new report suite does not filter out the IAB Bot-based hits.
- 8. Extract the list of values that show up in only the new report suite’s User-Agent eVar report (i.e., the bots) and analyze them to see if there are any patterns that emerge in their values.
Once you produce a list of patterns tied to bots from the User-Agent, you will be ready to add the necessary logic to the onBeforeEventSend property (or doPlugins, etc.). For example:
if(navigator.userAgent.includes("Googlebot")
|| navigator.userAgent.includes("spider")
|| navigator.userAgent.includes("adbot"))
return false;
The above JavaScript code determines if the User-Agent header contains either “Googlebot”, “spider”, or “adbot”. If so, then the onBeforeEventSend property will return false and thus prevent the server call from being sent out in the first place. That's it.
Once you deploy this type of code in your production environment, the number of server calls collected from your site should become notably reduced overall. You can confirm this by running a Workspace report using the Bot Name dimension as shown above.
Excluding Hits Based on the IP Address/CIDR
Preventing Adobe from collecting hits based on the IP Address – rather than user agent – is an admittedly more complex solution and will require a little more development work on your end. The biggest hurdle to overcome is the fact there are no out-of-the-box solutions – at least none that I'm aware of – that will automatically provide a device's IP Address via JavaScript (i.e., there's no navigator.userAgent equivalent for an IP Address).
To start off, you will need to engage your web development team to deploy server-side code (or take advantage of a third-party API) that determines each visitor’s IP Address. Privacy issues could be an issue here, so please check with your legal/privacy teams whether your development team can actually deploy such code in the first place. Also, keep in mind Adobe will never be able to provide support – technically or legally – for such code.
If your team can deploy the aforementioned code, they will then need to deploy additional code - as high up in each page as possible – that sets a window-level JavaScript (i.e., client-side) variable equal to a string containing the user’s IP Address. I use the name userIP for this variable, as shown in this example:
window.userIP = "128.128.128.52"
You (or your development team) will also need to create an array-based JavaScript variable called ipAddressRanges, which will contain the list of IP Addresses from which you want to prevent data collection. Each entry in the ipAddressRanges array will need to be a string containing either a single IP Address or a range of IP addresses in the form of CIDR notation. Various calculators across the web can convert any IP Address range into CIDR notation. I personally think this calculator on arin.net is worth checking out due to its simple-to-use interface.
Like the userIP variable, you will need to create and set the ipAddressRanges variable as high up in each page as possible. You may leverage an Adobe Data Collection Tags "Library Loaded (Top of Page)" rule to pull this off but regardless of how you deploy the ipAddressRanges variable, the code to set it would look something like the following:
window.ipAddressRanges = ["128.0.0.0/26","8.8.4.0/24","8.8.8.0/24","174.162.149.228"]
The four CIDR-based entries in the above example actually contain the equivalent of 577 different IP Addresses! The number of entries you can include in the ipAddressRanges array is unlimited, which practically means you could block data collection from, for example, a half or a third or a quarter of all the IP addresses in the world! For what it’s worth, I don’t recommend actually doing that.
For the final step, deploy the following JavaScript code into the onBeforeEventSend property (or doPlugins, etc.). Keep in mind this code won't work unless the userIP and ipAddressRanges variables have already been set on the page:
window.ipInRange=!1;
const ip4ToInt=b=>b.split(".").reduce((a,c)=>(a<<8)+ parseInt(c,10),0)>>>0,isIp4InCidr=b=>a=>{const [c,d=32]=a.split("/");a=~(2**(32-d)-1);return(ip4ToInt(b)&a)===(ip4ToInt(c)&a)},isIp4InCidrs=(b,a)=>a.some(isIp4InCidr(b));"undefined"!==typeof window.userIP&&"undefined"!==typeof window.ipAddresses&& (window.ipInRange=isIp4InCidrs(window.userIP, window.ipAddressRanges));
if(window.ipInRange===true)
return false;
The above code first creates a JavaScript variable called ipInRange and sets it equal to false, with the initial assumption that the user's IP Address (set in userIP) is not contained in the list of "bad" IP Addresses. However, if the code determines the user's IP Address is contained in any of the ranges specified in ipAddressRanges, it will change the value of ipInRange to true. The final two lines of code above say that if ipInRange is equal to true, then the Web SDK should not send out the server call (i.e., onBeforeEventSend returns false)
Conclusion
At least one more idea/solution for blocking bot traffic will be coming around the corner soon, so please keep your eyes peeled!
And of course, feel free to chime in with your thoughts and questions about anything you've read here – thanks all!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.