New

Add capability for clients using a subdomain like analytics.yoursitename.com to disallow search engines from crawling

Forum|Forum|3 years ago
July 18, 2022
10 replies
1688 views

Description - As a user, we would like Adobe to provide us access to the robots.txt file of our analytics subdomain so we can prevent search engines from crawling

Why is this feature important to you - Googlebot is crawling resources on our adobe hosted analytics subdomain millions of times a week. This is hurting our crawl budget, resulting in less traffic and revenue to our eCommerce sites

How would you like the feature to work - We would like the robots.txt set to Disallow all

User-agent: *
Disallow: *

Current Behaviour - The robots.txt file is set to Allow All

User-agent: *
Disallow:

Analytics

Jennifer_Dungan

Community Advisor and Adobe Champion

I hear what you are saying... but this would essentially render your bot reports useless.... if you are really getting inundated with legitimate bots like Google and Bing, you can user Google Search Console and Bing Webmaster Tools to modify the crawl rate on your sites... and if you are being hit by spammy bots, you should be able to block traffic from those sources altogether at your site's server level (which will also reduce strain on your servers, and not just your analytics).

While an interesting idea... and I will upvote this... I probably wouldn't use this feature myself as I use the Bot Reports frequently.

D

danatanseoAuthor

Thanks for the upvote Jennifer. Question: Adobe owns and manages this sub-domain, so we can't do anything at the server level. As a consequence, we also have zero access to bot reports. Even if we did, why would we ever want any bots to crawl this subdomain? It seems perhaps another course of action Adobe could take would be to do as you suggest and block bots to this subdomain at the server level.

Jennifer_Dungan

Community Advisor and Adobe Champion

@danatanseo I'm not sure why you are saying you have no access to Bot Reports... those are available to all clients.... Maybe someone hid those reports in the Admin Panel, but they exist and are important to many clients.

Right now they are in the "old" Reports area, but before Reports gets sunset, they will be building this capability into Workspace.

Bots - gives you a "per" identified bot breakdown of traffic

Bot Pages - gives you the overall bot traffic to each page

While bot traffic is removed from Workspace right now, this data is also available in Raw Data Feeds and this data can be processed in more detail there (and a lot of data lake teams want to use this data as well to understand server impacts and to work with the SEO team to analyze what is happening).

Personally, I use this information a lot to help understand what is happening within the site that has been excluded from Workspace.

I can understand not all companies would care, and in this case, blocking that traffic is fine (for you), but like I said, I wouldn't use such a feature as it would reduce visibility for what is happening.

D

danatanseoAuthor

@jennifer_dungan According to kohn@adobe.com:

"The bot reports are for a different purpose, and they are setup for Under Armour (I reviewed them with Dian last week). The Bot reports cover the bots that are hitting www.underarmour.com, and identify which bots and which pages are being crawled. The reports are at: Reports>Site Metrics>Bots. The Admin access is needed to configure the bot reports, but they are already configured for Under Armour."

This request I opened pertains to https://analytics.underarmour.com

Jennifer_Dungan

Community Advisor and Adobe Champion

But your tracking server is literally just hosting JS files... there's nothing to crawl, opening your files directly doesn't trigger tracking or drive up your costs... only tracking call affect your budget...

Literally the only thing that Google has on record is the domain, and not even the individual JS files... opening it results in an empty page, there are no tracking calls being made, and therefore no costs... Google, Bing, Yahoo, whoever can open this domain once or a million times and there will be no difference to your budget.....

Even IF Google or another bot found the individual server JS files, these won't trigger tracking calls... the JS file won't be executed when opened directly, there is no trigger code to execute the script.

D

danatanseoAuthor

Also for clarification: We are not concerned that this activity is impacting our budget with Adobe. We are concerned that this activity is seriously impacting our Google crawl budget, which is finite. "Budget" in this sense is not monetary. Here is how Google defines crawl budget: https://developers.google.com/search/docs/advanced/crawling/large-site-managing-crawl-budget

Historically, we have had Googlebot crawl analytics.underarmour.com so much that this activity nearly brought down our site servers, impacting real human user experience, and the site's ability to convert those visitors (lost revenue due to slow site performance). So this isn't just a problem for SEO.

Jennifer_Dungan

Community Advisor and Adobe Champion

To be honest, I don't understand why it would be getting so much traffic.. it's an empty page.... Checking Google Search Engine for our analytics subdomain, the last time it was crawled was over 3 months ago (and only because I added it temporarily to a publicly accessible dev environment for another site while I was waiting for a new tracking server to be provisioned, and google decided it would crawl it due to the cross domain reference). I checked the server for our next largest site, Google Search Console doesn't even have it in the index....

It sounds like you have something else weird going on... are you sure someone didn't add that reference into your sitemaps which would be crawled much more frequently?

This page should essentially render as a soft 404 and Google will pretty much ignore it without any manual intervention.....

J

Joel_Bradbury

I concur with @danatanseo. This is an important item. We too get huge spikes of search engine crawlers hitting analytics hosts of our domain and want to be able to stop it via robots.txt disallow.

@jennifer_dungan - this is a common issue as googlebot and other search engines are typically fetching all available resources from a webpage including executing js during the web rendering portion of the process as it doesn't know which resources/script may impact how the page is rendered and whether dynamic content happens to be added into the page post-script execution.

Additionally, we can see the volume of crawls being made in the analytics host in the crawl stats tab of Google Search Console, validating the volume impact in relation to the rest of the site. This is a large waste of limited search crawler resources and would be very useful to allow blocking via robots.txt.

Crawl Budget Management For Large Sites | Google Search Central | Documentation | Google for Developers

EricMatisoff

Adobe Employee

Exciting news everyone! We've just released the capability for customers to customize their robots.txt file when hosted by Adobe CNAME for first-party purposes. If you'd like to have yours customized, simply reach out to Customer Care and tell them you would like to change the robots.txt static file using EdgeCert (they should know what this means). Enjoy!

Jennifer_Dungan

Community Advisor and Adobe Champion

Whoot! Thanks @ericmatisoff . A few months ago, our tracking server got crawled like 9 million times... it's currently back to low traffic, but this should ensure that doesn't happen again...

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded