Expand my Community achievements bar.

Sitemap (SEO) best practices

Avatar

Level 1

Per: https://docs.adobe.com/docs/en/aem/6-1/manage/seo-and-url-management.html

"To programmatically generate a sitemap, register a Sling Servlet listening for a sitemap.xml call. The servlet can then use the resource provided via the servlet API to look at the current page and its children, outputting XML."

I believe this is talking about mysite.com/sitemap.xml (or whatever other specific sitemap XML file I want to use). Someone else thinks that is supposed to mean that every page should be able to change from

mysite.com/home.html to mysite.com/home.sitemap.xml

And then you can see the sitemap for the entire site.

That makes no sense to me. What is really meant by the quote?

12 Replies

Avatar

Administrator

Hi

I have asked the documentation team have a look at this.

Thanks and regards

Kautuk Sahni



Kautuk Sahni

Avatar

Administrator

Hi

Ian Reasor is our internal expert. 

I have again asked him to get back to you soon on this.

~kautuk



Kautuk Sahni

Avatar

Employee

"Someone else thinks that is supposed to mean that every page should be able to change from mysite.com/home.html to mysite.com/home.sitemap.xml And then you can see the sitemap for the entire site."

This is accurate.  You can register a Sling Servlet to listen for the selector 'sitemap' with the extension 'xml'.  This will cause the servlet to process the request any time a URL is requested that ends in /path/to/page.sitemap.xml.  You can then get the requested resource from the request and generate a sitemap from that point in the content tree by using the JCR APIs.  The benefit to an approach like this is when you have multiple sites being served from the same instance.  A request to /content/siteA.sitemap.xml would generate a sitemap for siteA while a request for /content/siteB.sitemap.xml would generate a sitemap for siteB without the need for writing additional code.

Avatar

Level 1

ireasor, SEO says:

robots.txt points to your 'main' sitemap.xml (usually /sitemap.xml)

If you have "sub sitemaps" you can point to those via loc.

At no point in time does anything SEO (and therefore sitemap) related say that _every single page_ on your site should be able to return a sitemap.xml

We have multiple (AEM) sites, for locales: en_us, es_es, etc, etc so it makes sense to have those 'sites' return a unique sitemap.xml via es_es.sitemap.xml 

But why in the world allow every html to also return the same sitemap content?

You've doubled the number of files dispatcher could cache (.html and sitemap.xml for every page). You've nearly doubled your storage requirements (depending on the number of pages you have your sitemap.xml might be as big or bigger than your page HTML's).

I can't understand the rationalization for the interpretation of the documentation. I'm _very_ interested in kautuksahni response.

Avatar

Employee

The dispatcher will only cache a page if it has actually been requested.  The way that I have managed this in the past is by using Apache mod_rewrite rules to redirect sitemap requests to the AEM paths that will handle the request.  Following your example of a multilingual site with en_us and es_es, I would configure Apache such that a request for mysite.com/en_us/sitemap.xml would rewrite to /content/my_company/en_us.sitemap.xml and mysite.com/es_es/sitemap.xml would rewrite to /content/my_company/es_es.sitemap.xml.  In theory, nobody would ever request /content/my_company/en_us/some_other_page.sitemap.xml directly and thus this content would never be cached.  If this was a concern, however, you could rewrite any sitemap.xml requests to the language root node or block them entirely at the dispatcher.

Avatar

Level 1

ireasor

While your logic is correct, "why would anyone request it", you're missing the point: If it can be requested, it could be and used as an attack vector to fill your file system and take your site offline. 

I'll concede this isn't a very strong argument because it's fairly easy to block it via apache, but we are literally creating more work for no reason.

For us (and I assume most?) the locale specific sitemaps are 'sub' sitemaps part of a bigger sitemap, i.e. mysite.com/sitemap.xml, referenced via <loc> tag.

Can you auto generate the mysite.com/sitemap.xml? I don't think so. So we've manually created that to point to our 'sub' (child) sitemaps which are auto generated.

Since we want them a specific, predetermined locations, wouldn't it make sense to only have specific URLs respond to generate the sitemap? say mysite.com/en_us/sitemap.xml?

Avatar

Employee

In the example, we are building a servlet to automatically generate the sitemaps, so in theory you could do anything with them that you want.  While it may take some work to map the inbound sitemap.xml requests to the appropriate server URLs, it also takes work for authors to manually update a sitemap file and re-publish it.  You could also have logic that pre-generates the sitemap and stores it as a file in a predetermined repository location, but this introduces its own complexities since you will need to detect when new content is activated and rebuild/reactivate the sitemap.  The main reason that I prefer having the sitemap generated by a servlet is that the dispatcher cache being invalidated with a low enough statfile level will invalidate the sitemap automatically.  

Avatar

Level 1

ireasor,

Yep, you could do anything you want. The point is that the link is the "best practices". Is it "best practice" to configure your site to respond to any request with the .html removed and sitemap.xml added to return the sitemap? That, to me, doesn't make any sense. Best practice would be: /robots.txt points to your /sitemap.xml which points to your auto generated sitemap's for each "site" you have configured in AEM. 1 file for each site, not EVERY file.

That's what I'm trying to clarify. What is the best practice and what did Adobe mean in their link.

Avatar

Employee

gmisura, I was the author of this document and have clarified what I meant by it.  What you have outlined certainly makes sense - you only need to expose a single sitemap for each site.  That being said, I haven't seen any benefit in writing additional code to limit the requests, especially since this could be done from the Apache configuration.  In addition as Feike posted above, the Simple Sitemap Generator in ACS Commons (https://adobe-consulting-services.github.io/acs-aem-commons/features/simple-sitemap.html) has a generalized implementation of this approach that should meet the needs of most projects and can save you from writing code entirely.

Avatar

Level 3

Hi ireasor​,

I was reading through your reply here and wanted to ask, is there any way to configure different domain for different locale site

For example : request for mysite.com/sitemap.xml would rewrite to mysite.com/content/my_company/en_us.sitemap.xml and mysite.es/sitemap.xml would rewrite to /content/my_company/es_es.sitemap.xml

We are running into issue where all locale site sitemap URL are starting with mysite.com/xxx instead of mysite.de/xxxx

I also look into Externalize where we can only configure one domain.

is there any other way to fix it ?

Avatar

Employee

Hi anuj.pathak​.  In the future, it would be better to start a new thread instead of adding on to an old one.  The use case you outline should be achievable through the use of mod_rewrite, though.  Please see mod_rewrite - Apache HTTP Server Version 2.4 for more information.

arunp99088702​, you may be able to use mappings and internalRedirects to meet this use case, but it wouldn't be the preferred approach due in part to complexities that would be introduced in cache clearing.  See the URL rewriting section at SEO and URL Management Best Practices for more information.