Expand my Community achievements bar.

Cloud Managed Deployment Pipeline Feature Request : Allow pipeline to continue after publisher failure

Avatar

AG7
Level 1

8/13/24

Request for Feature Enhancement (RFE) Summary: When a server fails to successfully deploy the pipeline aborts leaving servers in various states of deployment. Instead when a publisher fails, the failed publisher/dispatcher pair should be left out of rotation and the deployment should continue to the next server. A notification/incident is sent/created to CSE and customer. In an environment where there are 10 publisher/dispatcher pairs one server or even multiple servers failing can be left out of rotation and still serve proper traffic. The aborted pipelines in this case cause 8-9 hours of delays, sometimes days to resolve. This costs all parties a significant monetary and time loses. Please create a JIRA ticket and request for this. Keep me posted on when that is created and share regular updates on the progress of the request
Use-case: In large infrastructures deploys can take 9-10 hours to completed.  One failure can cost hundreds of thousands of dollars in delays that can take anywhere from 12 to days of additional deployment hours, leaving part of the infrastructure in varying states of deployment.  Rolling back or manually deploying to the remaining servers is not an option as content over 9-10 hours already has a delta, and manually deploying 30 packages across 21 servers takes even longer.  This feature would reduce hours, and cost significantly as well as reduce the  production server downtime caused by deployment failures due to infrastructure reasons.
Current/Experienced Behavior: There are 10 publish/dispatcher pairs.  The deployment fails on publisher 3 because publisher 3 is experiencing an issue or high load, when this happens the entire deployment is aborted.  This leaves publisher/dispatcher pair 4 thru 10 un-deployed, so now we have author, pub1 and pub 2 serving traffic using new code and the rest of the stack serving traffic on old code.  To redeploy the pipeline again using the re-execute feature can take another 10 hours.  
Improved/Expected Behavior: When a non author server fails to complete deployment, keep that publish/dispatcher pair out of rotation and continue the deployment to the rest of the infrastructure.  Alert custom and CSE that a publisher/dispatcher is out of rotation due to a deploy failure.   Add a feature into the CM pipeline that allows the ability to deploy to a single dispatcher publisher pair, or any server in your infrastructure.  
Environment Details (AEM version/service pack, any other specifics if applicable): AEM 6.5.21 : 21 server infrastructure, 1 author, 10 publishers, 10 dispatchers.  
Customer-name/Organization name: Undisclosed
Screenshot (if applicable): N/A
Code package (if applicable): N/A
1 Comment

Avatar

Administrator

9/13/24

@AG7 

Thanks for proposing this idea.
This has been reported to the engineering under the internal reference SITES-24126. The product team will triage this request to verify feasibility based on the prioritization model. This post will be updated according to the Jira status.
Status changed to: Investigating