Expand my Community achievements bar.

Learn about Edge Delivery Services in upcoming GEM session

Cold Standby for TarMK based AEM 6.5 Author (Does it work?)

Avatar

Level 4

I recently opened a ticket with Adobe support re: HA and disaster recovery options for AEM 6.5. Unfortunately, the engineer I spoke with didn't have much experience with it and referred me to some documentation, and said support could not assist with the configuration. I found this odd, especially given the official documentation. I imagine it's likely others could have questions about it too. Looking over the doc, I'm almost certain I will have questions.

This doc:

https://experienceleague.adobe.com/docs/experience-manager-64/deploying/deploying/tarmk-cold-standby...

seems like it makes the most sense for our on-premises AEM 6.5 TarMK (file data store) based instance.

The instructions are somewhat complex/involved (I'm a little confused by the crx-quickstart/install config files for the standby instance instructions).

I recall earlier AEM 5.x clustering was not recommended and maybe did not work properly?

Does anyone know if the above cold standby config is supported / recommended by Adobe, and is anyone using it successfully? Have you promoted standby to primary and did it work well?

Is it better just to rely on backups? This might result in more downtime for authors (but would be simpler) vs having a cold standby available. It seems there may be some steps involved to make a standby primary in the event of primary failure. In this situation, does standby become primary and then a new standby would need to be created/cloned/configured to have HA/DR available again?

In the past we had a standby author instance by publishing to a separate author. Of course this isn't ideal as many things are not replicated, but it's was much easier to configure.

I'm curious to know what other on-premises AEM customers are using (or considering using) for author HA/DR? I'm also curious to know if anyone knows of any supplemental resources/instructions outlining cold standby configurations.

3 Replies

Hi Saravanan, 

Thanks for pointing me to that blog. I've seen the blog before for other articles and it's good. This article doesn't really go into the details of cold standby configuration or how reliable it is. It did confirm clustering in previous versions of AEM was not reliable.

I think I'll need to spend time with the official documentation and try to make sense of it, experimenting with local instances, maybe opening a new ticket with Adobe if something needs clarification.

If anyone else has experience with cold standby for Adobe AEM 6.5 (TarMK with file datastore), or links to supplemental info re: Adobe AEM 6.5 (TarMK with file datastore) cold standby, I would be interested to hear your experience and/or read the info.

Related, I found this post https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/crx-synchronization-of-two... by @tadreeves (the blogger referenced earlier). I'm particularly interested in the following: 

 

The other thing that many companies will do, for instances that are running in the cloud, is to simply take a snapshot of the primary device and paste it over onto a cold standby device that you could stand up at a moment's notice. You don't get real-time updates with this, HOWEVER that can be a blessing as well. Many of the issues that will take down and author environment (or really create a "mission kill" of an author environment) are unintentional deletions or mass tag corruptions or other such user-initiated items that have nothing whatever to do with a physical outage. Such issues would be replicated automatically if you have an auto-replication setup, but if you're taking snapshots once every [interval] then you have [interval] time to refer to the backup.

 

I think this option might make the most sense for us.

I see pros as:

  1. don't have to deal with complex/brittle configuration of cold standby
  2. don't have to deal with potential performance degradation caused by replication of repo to cold standby
  3. don't have to deal with possible corruption that might be replicated to cold standby

I see cons as:

  1. snapshot standby instance might not have up to the minute data
  2. snapshot standby instance could take more time to bring online