Expand my Community achievements bar.

Segmentation: Bot Traffic Identification & Exclusion Tool

Avatar

Level 10

1/9/17

I need help identifying bot traffic because we get a ton of it; somewhere between 30 - 40% of our page views and 5-15% of visits.  I am not referring to known bots (Google Spider) or malicious bots attempting to take down our site or defraud us.  I am referring to third party scrapers coming to us for information.  This type of trafffic is not looked at negatviely because 1) it is not harmful to our site experience. 2) everyone does it. 3) it is difficult to police.

Because we get so much bot traffic, we spend a chunk of time identifying if swings in our KPIs are real or due to non-human traffic.  This slows us down considerably. The bots coming to our site use standard devices, user agent strings, operating systems, devices, and also change their IP addresses frequently.  I am able to qualitatively identify this traffic because of the following:

1. This traffic is typed/bookmarked.

2. This traffic never has any of our campaign parameters.

3. This traffic lands on pages that would not normally be a direct landing page (i.e. a specific product page)

4. This traffic is from the 'Other' device type.

5. Page Views = 1 per visit.

6. Visits = Visitors and visits is showing very high numbers, i.e > 1k when looking at captured IP addresses.

So, whoever is crawling our site is deleting their cookies on the same IP address and viewing a single page view.   See attached for a screenshot.

It would be great to somehow aggregate visits from different visiors (cookies) where certain behaviors are taking place.  For example: 

Exclude all 'Visitors' if

1. 'Any value' for a given variable (evar/prop) shows up more than X times.

AND

2. PVs per Visit for each visit <= 1

AND

3. Traffic Source for all visits is typed/bookmarked.

We can solve for this in SQL , but not sure its doable in Adobe.  Any thoughts?

12 Comments

Avatar

Employee

1/9/17

 Hi Michael,

 

We're actively investigating bot identification and filtering, so someone from my team may reach out to you for a deeper conversation. Our goal is to automatically identify non-human traffic and filter it out. We'd also like to report on it so you're aware of what content malicious bots are consuming. Would you prefer that the data be excluded entirely, or do you want the full reporting abilities of Analytics to be applied to bot behavior? What kinds of reports would you want on bots? 

 

In the mean time, have you considered classifying the IP address as bot-or-not? If you combine this with Virtual Report Suites, you can create a virtual report suite that excludes bot-visitors via a segment, and that segment can be updated over time. The easiest way to keep that segment up to date is to classify various IP addresses as bot and use the segment to exclude visitors where bot_check (a classification of IP address) equals "bot". Because bot IP addresses can change over time, a more complete solution is to combine IP+day as a Customer ID in the Visitor ID service, and use customer attributes to exclude visitors with a customer ID of IP+day. This would require regular updates via FTP, but would give you flexibility to exclude data from a virtual report suite at any time.

Avatar

Level 5

2/14/17

Bret,

 

We run into similar issues here - lots of researchers mining our content. I like your idea of IP address classifications, but we get enough traffic that many (61%) of our IP addresses show up as (Low Traffic).

 

Regarding your question of reporting on bot traffic: I don't want to pay for these requests. It's interesting, but not worth the add-on to our contract. We have apache logs and Kibana or Splunk do what we need to do. 

 

A game-changer for me would be if we could extend bot rules to include user-agent + IP address combos, and request retroactive reprocessing of data when a weird one comes in. As it stands we need to use a "cleanup" segment that we add on to as we find them, and then a virtual report suite using that cleanup segment. 

 

But, Michael makes a good point - with cloud computing getting cheaper, it's not hard to run a distributed bot across many IPs, spoofing user-agents, and executing JavaScript. I wish Adobe did more to ID and retroactively remove this stuff.

Avatar

Employee

2/14/17

Thanks for the comments, Danielle, and for weighing in on the value of getting reports about bots. We agree that cloud computing is changing the landscape, making it easier for bots to spoof humans. As a result, our hope is to partner with companies who specialize in bot identification. If there are a group of people whose livelihood depends on identifying clandestine bots, they're more likely to stay on top of the problem than companies focused on larger or related problems.

 

If you want to take action on this in the short term, here are some companies that could help you with real-time bot identification. They're in no particular order: ShieldSquare, WhitOps, PerimeterX. Once bots are identified, you can use JavaScript to exclude hits from coming to Analytics, using the s.abort flag (https://marketing.adobe.com/resources/help/en_US/sc/implement/abort.html).

 

Either way, we're still working hard on this, and when we have a target delivery date we plan to udpate this post. 

Avatar

Level 10

2/17/17

s.abort is scary for me.  I would still prefer to exclude them after the fact or add them to the bot report.I would want to know how often we are being hit and what is being hit.  the data can be used for strategic analysis/decision making.  i.e. we have been able to hypothesize our competitors strategic moves based on the bot traffic we have identified. : )

Avatar

Employee Advisor

2/17/17

I agree that s.abort can be scary. Could be an opportunity for setting a Customer Attribute or eVar value in JS based on the response from those bot identification platforms that Bret mentioned above. Filter them out via segment + virtual report suite and you've got best of both worlds: clean data and ability to analyze the bot behavior.

 

Mike - I'm assuming that the majority of these bots that you find are only on New Visits? Or do you ever see Return Visits for a bot? My assumption is that the bots clear cookies between each session, but you would know the best.

Avatar

Level 10

2/18/17

@Matifsoff 

 

If bots have 2+ visits, I have not been good at identifying them.  The bots I have clearly identified are New Visits.  This makes up the majority of our conversion rate crushing traffic. User agents are the same, IPs stick around for several hundred or thousand Visits, but cookies are cleared.  So we see something like 1500 Visits, 1500 Visitors, 1500 Product Detail Pages, 0 Custom Links per  IP address.

 

We are leveraging a virtual report suite which has been great.  Only issue I have found is that the metrics created in our standard report suite are not available for the virtual report suite within Report Builder.

 

We do leverage a 3rd party bot detector and set it as a variable but it is mediocre at best. Where I think Adobe can play a role (stated above) is around segmentation by somehow allowiing us to exclude IP addresses that exhibit behaviors discussed above (1500,1500,1500). This logic doesn't  fit into the visitor, visit, hit segmentation container methodology.

 

I guess I am asking for an IP address based container.  Something like:

 

IP Address where,within X time period

1. unique visitors is > X

AND

2. visits is > X

AND

3 page views is > X

 

I am sure there are flaws in my logic as bots will likely behave differently across sites as well as change behavior as companies become smarter at identifying them.