Expand my Community achievements bar.

Bot Traffic Through Raw Full URL

Avatar

Level 5

Hi All,

 

We have a system where we look at the Raw Full URL of the page, then the Operating System, then the IP Address in Adobe Analytics to identify bot traffic. We are getting a full URL with "?test=1" in the query string parameter that seems to be bot traffic. Are there other ways or query strings in the full raw URL that would indicate Bot Traffic too?

 

Thanks!

12 Replies

Avatar

Community Advisor and Adobe Champion

Dear @skatofiabah ,

'?test=1' need not be a bot ideally; rather, somebody is testing the page load in production (it might be your QC team) or trying to clear Akamai cache (usually, we use the query string parameter). 

Pretty not sure that using query string parameter is a reliable method to identify BOT.

Thank You, Pratheep Arun Raj B (Arun) | NextRow Digital | Terryn Winter Analytics

Avatar

Level 3

Hi @skatofiabah,

 

Agreed that the query string parameter would not be indicative of a bot. Personally, what we use is a mixture of User Agent, IP address, Countries, Click behavioural data, Timestamp and page views events to help identify the larger bots. I would also check the IAB filter to ensure that it is turned on (though it doesn't capture new/smaller bots).

Cheers,

Vernon

Avatar

Level 5

Hi @Vernon_H,

What is User Agent? I don't see that dimension in Adobe. Are there any other ways to slice besides those dimensions above?

Avatar

Level 3

Hi @skatofiabah,

For more information on user agent, here's a documentation of the string format.

We capture user agent as a seperate eVar via the JS variable navigator.userAgent. I would personally highly recommend having this captured (especially with the rise of crawlers to power Gen AI tools). 
An example of an infamous bot from ByteDance that appeared on our radar earlier this year:  
mozilla/5.0 (linux; android 5.0) applewebkit/537.36 (khtml, like gecko) mobile safari/537.36 (compatible; bytespider; spider-feedback@bytedance.com) 

From the user agent alone, you can easily do a Google search or immediately tell its some sort of a spider from bytedance.com. So you can then do an exclusion in your reports for this. 
Other smaller scale bots might use legit user agents, hence, you will need other dimensions and metrics to support your exclusion (e.g. link clicks actions, bounce rates, time spent on page, etc.).

Hope this info helps! Goodluck!

 

Cheers,

Vernon

Avatar

Level 5

Hi @Vernon_H,

 

We will investigate this. What does a normal user agent look like vs. a bad user agent if we capture it in an eVar?

Avatar

Level 3

Hi @skatofiabah ,

A normal user agent will look something like this 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'. You can also find out what is your personal user agent by opening dev tools and navigating to the console tab and typing navigator.userAgent. This should show you your own user agent. 


The first step I would recommend doing is as @Jennifer_Dungan has mentioned would be to capture these user agents as an eVar (they sometimes exceed the 255 character limit). Then evaluating to see if there's a way to identify the slightly more obvious bots - I would be looking out for linux OS, headless chrome and organisational names/urls. You can do a google search based on the user agent to get more information.

 

Two examples:
"mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) headlesschrome/121.0.6167.0 safari/537.36"
"mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/126.0.0.0 safari/537.36 observepoint"

 

So far, we're only focusing on excluding the larger and more disruptive ones, smaller ones have legit looking user agents, but you'll need a couple of other metrics (as mentioned above) to identify them.

Cheers,
Vernon

Avatar

Community Advisor and Adobe Champion

Headless Chrome may not even be bot traffic... I know our developers are using that for some of our app webviews, also, people who "bookmark" a website to their phone's homescreen load the website like a app (in the headless mode).

Avatar

Level 5

Hi @Vernon_H,

 

What is Headless Chrome and what does that mean?

Avatar

Community Advisor and Adobe Champion

Headless Chrome means that the Chrome rendering engine is used, but all the "Chrome Controls" (such as the menu, previous and next buttons, URL bar, etc) are all hidden. Basically, just the webpage content is shown, but none of the standard user controls.

 

While this can be common for bots, there are many reasons why general users would see this view as well.. it's not really a "silver bullet" of bot detection.

Avatar

Level 3

Hi @Jennifer_Dungan,

 

Interesting point in regards to the "bookmarks" on mobile phones - something I'll want to look into!

 

100% not a sliver bullet, it has to be paired with other behavioural metrics. In our organisation's context, we do not expect a normal user to access the site via headless chrome and from the behavioural data is pretty obvious it's not a human. 

 

The above was some suggested indicators to look out for. 

 

Cheers,

Vernon

Avatar

Community Advisor and Adobe Champion

Oh I agree about adding other behavioural checks... it's just that depending on your setup, there may be ways your own developers are using headless for real users, just some added context to watch out for

Avatar

Community Advisor and Adobe Champion

User Agent is captured in the Raw Data, but not generally disclosed to the Workspace data... however, you can capture the user agent in an eVar (either using Launch and setting the eVar directly), or using a Processing Rule to copy the User Agent into an eVar. You can also export the User Agents with Raw Data, but I don't think it's available elsewhere, not even in the Data Warehouse.... User Agent is used to determine the Browser and OS, so I guess Adobe didn't feel it was needed to share the full user string... but I find there is more to be gained from keeping an eye on the User Agent itself.

 

I am suggesting eVar over prop so that you have more available characters (255 in an eVar vs 100 in a Prop), but I would still make sure the eVar is set to Hit expiry so that it acts like a prop.


Good Bots, like Search Engine bots (if not being blocked by IAB rules) should have an indication of being a bot (it should say something like "Googlebot" in the User Agent), however bad bots (like scrapers) may not identify so easily... 

 

You can use a User Agent lookup tool like https://www.whatismyip.net/tools/user-agent-lookup.php to get familiar with various user agent strings... generally bad bots will be using weird or very old browser identification... IE6 User Agent was a very common User Agent a few years ago.