We are currently using an on-premises AEM instance. As a best practice, we typically remove two publishers from the load balancer, deploy the code packages to them, and then repeat the process for the remaining publishers. This helps avoid user impact or traffic disruption. However, this approach is slow and challenging to automate through a CI/CD pipeline.
We're considering deploying to all publishers simultaneously to streamline the process. Are there any risks associated with this approach? Additionally, how can we ensure that user traffic remains unaffected during bundle restarts following the deployment?
해결되었습니다! 솔루션으로 이동.
조회 수
답글
좋아요 수
In the context of an AEM (Adobe Experience Manager) on-premises instance, this can be broken down into two distinct scenarios:
If your AEM website is serving primarily static content (e.g., cached pages, clientlibs, assets):
During deployments, you can safely remove all AEM Publish instances from the load balancer, ensuring no live traffic hits them. Since the content is cached (either at dispatcher or CDN level), end users will continue receiving responses without disruption. This allows for a zero-downtime deployment process while avoiding potential cache invalidation or content inconsistency issues.
If your AEM site includes dynamic content or integrations with backend/external services (e.g., APIs, commerce systems, personalization):
In such cases, your current approach—likely involving a rolling deployment or blue-green deployment strategy—is more suitable. This ensures that there’s always at least one active and stable publish instance handling live requests, minimizing the risk of downtime or broken integrations during the deployment process.
Hi @sanjeevkumart45,
Yes, there are some risks as below:
Bundle Restarts Cause Temporary Downtime
Deploying code packages often causes OSGi bundles to restart. During this time:
Sling may return 503s (Service Unavailable).
Pages may load partially or fail completely.
User experience degrades if all publishers are affected simultaneously.
Cache Invalidation or Inconsistency
If you're using a dispatcher or CDN (e.g., Akamai), simultaneous invalidation or rebuild across all publishers may:
Result in cache thrashing.
Increase origin traffic.
Slow down performance due to concurrent repopulation.
Session or Token Loss (if applicable)
Authenticated sessions or login flows could be interrupted during deployment if the backend (publish) is unstable, especially in secure applications.
Harder to Roll Back
If something breaks, recovering is slower since all publishers need to be reverted or fixed — rather than just one subset.
It's always recommended to follow Blue-Green approach:
Instead of all-at-once:
Keep your publisher nodes grouped (e.g., Group A
and Group B
).
Deploy to Group A
, verify health.
Then switch traffic and deploy to Group B
.
This is safer and can still be automated.
Structure your deployment to:
Avoid deploying large number of bundles or deeply interdependent code at once.
Use OSGi configurations and content packages that don't always trigger full restarts.
Pre-process packages using package filters to avoid unnecessary overwrite of existing resources.
References:
Hope that helps!
Hi @sanjeevkumart45,
Yes, there are definite risks with deploying to all publishers simultaneously:
Service Downtime or Disruption
All publishers restart bundles at the same time, potentially causing temporary unavailability or errors for users.
Load Balancer Health Check Failures
If health checks are not precise or slow to detect the node's unhealthy state during restarts, traffic may be routed to nodes still coming up, causing failures.
Widespread Impact of Bugs
Any deployment bug affects all publishers immediately, causing a full outage rather than a limited impact.
Difficult Rollback
Rolling back a faulty deployment is more complex because all nodes are updated simultaneously.
Cache Invalidation Issues
Simultaneous bundle restarts may cause large-scale cache invalidation or warming delays, impacting performance temporarily.
User Experience Degradation
Users may experience errors, slow page loads, or failed asset deliveries during the restart window.
Here are some strategies to mitigate the risks of deploying simultaneously to all publishers and to ensure user traffic remains unaffected during bundle restarts after deployment, you can follow these best practices:
Use Robust Health Checks:
Configure the load balancer to perform detailed health checks that verify the full readiness of each publisher (not just basic server up/down). Only route traffic to publishers that have completed bundle restarts and are fully operational.
Graceful Shutdown and Startup:
Implement graceful shutdown hooks on publishers so they stop accepting new requests before restarting bundles, allowing ongoing requests to finish without disruption.
Leverage Caching Layers:
Utilize AEM Dispatcher caching along with any CDN or reverse proxy caching to serve user requests while publishers are restarting. This reduces load on publishers and minimizes user-visible downtime.
Handle Direct Application-Level Requests:
Note that caching primarily helps with static or cacheable content. Requests involving query parameters, servlets, or other dynamic endpoints often bypass cache and hit the publisher directly. For these, robust health checks and load balancer routing are critical to avoid sending traffic to restarting instances. Consider session draining or designing stateless, retryable endpoints to improve resilience during restarts.
Deploy During Low-Traffic Periods:
Schedule deployments during off-peak hours to minimize the number of affected users in case of transient issues.
Monitoring and Alerts:
Set up real-time monitoring on health check endpoints, error rates, and response times to detect and respond quickly if issues arise.
Overall, the goal is to make deployments as smooth and low-risk as possible but zero risk is rarely achievable in complex systems. Therefore, implementing a rolling deployment strategy, is generally the safest approach to minimize risk while maintaining availability.
Recommended Approach : Rolling Deployment - A deployment strategy where updates are gradually rolled out to a subset of servers (or nodes) at a time. Commonly used in Kubernetes, AWS, Azure DevOps, CI/CD pipelines.
Let’s say you have 4 publishers: P1, P2, P3, P4.
Batch 1: P1 and P2
Remove from load balancer or mark as draining.
Deploy packages.
Wait for:
Bundle stability (system/console/bundles status check).
Health check to return 200 (/status.html, custom health endpoint).
Add P1 and P2 back.
Batch 2: P3 and P4
Same as above.
At no point is the entire publishing layer down, you maintain uptime while reducing overall deployment time.
To ensure traffic remains unaffected, add these:
Custom endpoint like /bin/healthcheck that checks:
Sling readiness
Key bundles (WCM, DAM, Granite, Project Core)
Dispatcher cache
Pre-warm your dispatcher cache after each batch deploys to avoid slow first-page loads.
Ensure your LB has retry logic when a node is slow/unavailable temporarily.
Here’s a clear comparison table between Deploying to All Publishers Simultaneously and Rolling Deployment for your on-prem AEM setup:
Aspect | Deploying to All Publishers Simultaneously | Rolling Deployment |
Speed | Fastest — all updates happen at once | Slower — updates happen in batches |
User Impact Risk | Medium to high — all publishers restart at once, may cause downtime or errors | Low — some publishers remain live, minimizing disruption |
Automation | Easier to automate in CI/CD pipelines | Requires orchestration but still automatable |
Load Balancer Dependency | Must have very robust health checks to avoid routing traffic to restarting nodes | Health checks verify small batches, reducing risk |
Rollback Complexity | Higher — entire environment affected at once | Easier — rollback limited to updated subset |
Operational Complexity | Lower — single step deployment | Moderate — staged deployment and monitoring |
Best Use Case | Non-critical environments or very robust infra | Production environments needing high availability |
Hope this helps!
조회 수
답글
좋아요 수
Hi @sanjeevkumart45 ,
While others have already highlighted the impact of deploying simultaneously to all publishers, I’d like to share our experience. In our project, we had a similar setup and implemented a shell script to automate the deployment process. The script removes a publisher from the load balancer, deploys the code, verifies the health status, and then adds it back to the load balancer. This approach has allowed us to automate the deployment steps to a good extent, supporting our CI/CD goals.
An alternative approach often suggested is the blue-green deployment model. While effective, we found it to be cost-prohibitive in an on-premises environment due to the need for duplicate infrastructure.
You need to rely on dispatcher caching first to make sure the content is cached, so the requests are not coming to the AEM publish and being returned via cache only first.
You can write a shell script in Jenkins to automate that will first remove a publish instance from the load balancer, install the required code package, check for the system health and make a call to the homepage to check the status code and once it's done reattach and perform the same steps on the next publish instance
Hi @sanjeevkumart45 ,
Deploying simultaneously to all publishers risks end-user disruption, even for a few seconds:
- Bundle restarts during deployment can cause 503 errors.
- All publishers going down means no fallback for traffic.
- Health checks may lag, causing traffic to hit half-ready instances.
- If there's a bug in the package, rollback becomes chaotic.
- Dispatcher cache eviction and warming will hit all publishers at once.
Rolling Deployment Strategy
Deploy in batches (2 at a time if you have 4 publishers) while others continue serving traffic.
Assume Publishers: pub1, pub2, pub3, pub4
Batch 1: pub1 & pub2
- Remove from Load Balancer (use API or route removal).
- Install Code Package (via curl or Jenkins deploy plugin).
- Trigger Bundle Stabilization Check
Example:
curl -u admin:admin http://pub1:4503/system/console/bundles/.json | jq '.status'
curl -f http://pub1:4503/libs/granite/core/content/login.html
Add back to Load Balancer
Batch 2: pub3 & pub4
Repeat the same steps.
Regards,
Amit
In the context of an AEM (Adobe Experience Manager) on-premises instance, this can be broken down into two distinct scenarios:
If your AEM website is serving primarily static content (e.g., cached pages, clientlibs, assets):
During deployments, you can safely remove all AEM Publish instances from the load balancer, ensuring no live traffic hits them. Since the content is cached (either at dispatcher or CDN level), end users will continue receiving responses without disruption. This allows for a zero-downtime deployment process while avoiding potential cache invalidation or content inconsistency issues.
If your AEM site includes dynamic content or integrations with backend/external services (e.g., APIs, commerce systems, personalization):
In such cases, your current approach—likely involving a rolling deployment or blue-green deployment strategy—is more suitable. This ensures that there’s always at least one active and stable publish instance handling live requests, minimizing the risk of downtime or broken integrations during the deployment process.
@sanjeevkumart45 Just checking in — were you able to resolve your issue?
We’d love to hear how things worked out. If the suggestions above helped, marking a response as correct can guide others with similar questions. And if you found another solution, feel free to share it — your insights could really benefit the community. Thanks again for being part of the conversation!
조회 수
답글
좋아요 수