Adobe Experience Manager Sites & More

sanjeevkumart45 · 5/20/25

We are currently using an on-premises AEM instance. As a best practice, we typically remove two publishers from the load balancer, deploy the code packages to them, and then repeat the process for the remaining publishers. This helps avoid user impact or traffic disruption. However, this approach is slow and challenging to automate through a CI/CD pipeline.

We're considering deploying to all publishers simultaneously to streamline the process. Are there any risks associated with this approach? Additionally, how can we ensure that user traffic remains unaffected during bundle restarts following the deployment?

K_M_K_Srikanth · 5/21/25

Hi @sanjeevkumart45

In the context of an AEM (Adobe Experience Manager) on-premises instance, this can be broken down into two distinct scenarios:

If your AEM website is serving primarily static content (e.g., cached pages, clientlibs, assets):
During deployments, you can safely remove all AEM Publish instances from the load balancer, ensuring no live traffic hits them. Since the content is cached (either at dispatcher or CDN level), end users will continue receiving responses without disruption. This allows for a zero-downtime deployment process while avoiding potential cache invalidation or content inconsistency issues.
If your AEM site includes dynamic content or integrations with backend/external services (e.g., APIs, commerce systems, personalization):
In such cases, your current approach—likely involving a rolling deployment or blue-green deployment strategy—is more suitable. This ensures that there’s always at least one active and stable publish instance handling live requests, minimizing the risk of downtime or broken integrations during the deployment process.

View solution in original post

SantoshSai · 5/20/25

Hi @sanjeevkumart45,

Yes, there are some risks as below:

Bundle Restarts Cause Temporary Downtime
Deploying code packages often causes OSGi bundles to restart. During this time:
- Sling may return 503s (Service Unavailable).
- Pages may load partially or fail completely.
- User experience degrades if all publishers are affected simultaneously.
Cache Invalidation or Inconsistency
If you're using a dispatcher or CDN (e.g., Akamai), simultaneous invalidation or rebuild across all publishers may:
- Result in cache thrashing.
- Increase origin traffic.
- Slow down performance due to concurrent repopulation.
Session or Token Loss (if applicable)
Authenticated sessions or login flows could be interrupted during deployment if the backend (publish) is unstable, especially in secure applications.
Harder to Roll Back
If something breaks, recovering is slower since all publishers need to be reverted or fixed — rather than just one subset.

It's always recommended to follow Blue-Green approach:

Instead of all-at-once:

Keep your publisher nodes grouped (e.g., Group A and Group B).
Deploy to Group A, verify health.
Then switch traffic and deploy to Group B.

This is safer and can still be automated.

Minimize bundle disruption during deployment

Structure your deployment to:

Avoid deploying large number of bundles or deeply interdependent code at once.
Use OSGi configurations and content packages that don't always trigger full restarts.
Pre-process packages using package filters to avoid unnecessary overwrite of existing resources.

References:

Hope that helps!

Santosh Sai

pranayr · 5/20/25

Hi @sanjeevkumart45,

Yes, there are definite risks with deploying to all publishers simultaneously:

Service Downtime or Disruption
All publishers restart bundles at the same time, potentially causing temporary unavailability or errors for users.
Load Balancer Health Check Failures
If health checks are not precise or slow to detect the node's unhealthy state during restarts, traffic may be routed to nodes still coming up, causing failures.
Widespread Impact of Bugs
Any deployment bug affects all publishers immediately, causing a full outage rather than a limited impact.
Difficult Rollback
Rolling back a faulty deployment is more complex because all nodes are updated simultaneously.
Cache Invalidation Issues
Simultaneous bundle restarts may cause large-scale cache invalidation or warming delays, impacting performance temporarily.
User Experience Degradation
Users may experience errors, slow page loads, or failed asset deliveries during the restart window.

Here are some strategies to mitigate the risks of deploying simultaneously to all publishers and to ensure user traffic remains unaffected during bundle restarts after deployment, you can follow these best practices:

Use Robust Health Checks:
Configure the load balancer to perform detailed health checks that verify the full readiness of each publisher (not just basic server up/down). Only route traffic to publishers that have completed bundle restarts and are fully operational.
Graceful Shutdown and Startup:
Implement graceful shutdown hooks on publishers so they stop accepting new requests before restarting bundles, allowing ongoing requests to finish without disruption.
Leverage Caching Layers:
Utilize AEM Dispatcher caching along with any CDN or reverse proxy caching to serve user requests while publishers are restarting. This reduces load on publishers and minimizes user-visible downtime.
Handle Direct Application-Level Requests:
Note that caching primarily helps with static or cacheable content. Requests involving query parameters, servlets, or other dynamic endpoints often bypass cache and hit the publisher directly. For these, robust health checks and load balancer routing are critical to avoid sending traffic to restarting instances. Consider session draining or designing stateless, retryable endpoints to improve resilience during restarts.
Deploy During Low-Traffic Periods:
Schedule deployments during off-peak hours to minimize the number of affected users in case of transient issues.
Monitoring and Alerts:
Set up real-time monitoring on health check endpoints, error rates, and response times to detect and respond quickly if issues arise.
Quick Rollback Plan:
Always have a rollback plan ready to quickly revert to the previous stable version if unexpected issues occur during or after deployment. This minimizes downtime and user impact.

Overall, the goal is to make deployments as smooth and low-risk as possible but zero risk is rarely achievable in complex systems. Therefore, implementing a rolling deployment strategy, is generally the safest approach to minimize risk while maintaining availability.

Recommended Approach : Rolling Deployment - A deployment strategy where updates are gradually rolled out to a subset of servers (or nodes) at a time. Commonly used in Kubernetes, AWS, Azure DevOps, CI/CD pipelines.

Let’s say you have 4 publishers: P1, P2, P3, P4.

Step-by-Step CI/CD Flow:

Batch 1: P1 and P2
- Remove from load balancer or mark as draining.
- Deploy packages.
- Wait for:
  - Bundle stability (system/console/bundles status check).
  - Health check to return 200 (/status.html, custom health endpoint).
- Add P1 and P2 back.
Batch 2: P3 and P4
- Same as above.

At no point is the entire publishing layer down, you maintain uptime while reducing overall deployment time.

How to Make This Even Safer

To ensure traffic remains unaffected, add these:

1. Robust Health Checks

Custom endpoint like /bin/healthcheck that checks:
- Sling readiness
- Key bundles (WCM, DAM, Granite, Project Core)
- Dispatcher cache

2. Dispatcher Caching

Pre-warm your dispatcher cache after each batch deploys to avoid slow first-page loads.

3. Retry Logic in Load Balancer

Ensure your LB has retry logic when a node is slow/unavailable temporarily.

Here’s a clear comparison table between Deploying to All Publishers Simultaneously and Rolling Deployment for your on-prem AEM setup:

Aspect	Deploying to All Publishers Simultaneously	Rolling Deployment
Speed	Fastest — all updates happen at once	Slower — updates happen in batches
User Impact Risk	Medium to high — all publishers restart at once, may cause downtime or errors	Low — some publishers remain live, minimizing disruption
Automation	Easier to automate in CI/CD pipelines	Requires orchestration but still automatable
Load Balancer Dependency	Must have very robust health checks to avoid routing traffic to restarting nodes	Health checks verify small batches, reducing risk
Rollback Complexity	Higher — entire environment affected at once	Easier — rollback limited to updated subset
Operational Complexity	Lower — single step deployment	Moderate — staged deployment and monitoring
Best Use Case	Non-critical environments or very robust infra	Production environments needing high availability

Hope this helps!

narendiran_ravi · 5/20/25

Hi @sanjeevkumart45 ,

While others have already highlighted the impact of deploying simultaneously to all publishers, I’d like to share our experience. In our project, we had a similar setup and implemented a shell script to automate the deployment process. The script removes a publisher from the load balancer, deploys the code, verifies the health status, and then adds it back to the load balancer. This approach has allowed us to automate the deployment steps to a good extent, supporting our CI/CD goals.

An alternative approach often suggested is the blue-green deployment model. While effective, we found it to be cost-prohibitive in an on-premises environment due to the need for duplicate infrastructure.

chaudharynick · 5/21/25

Hi @sanjeevkumart45

You need to rely on dispatcher caching first to make sure the content is cached, so the requests are not coming to the AEM publish and being returned via cache only first.

You can write a shell script in Jenkins to automate that will first remove a publish instance from the load balancer, install the required code package, check for the system health and make a call to the homepage to check the status code and once it's done reattach and perform the same steps on the next publish instance

AmitVishwakarma · 5/21/25

Hi @sanjeevkumart45 ,

Deploying simultaneously to all publishers risks end-user disruption, even for a few seconds:

- Bundle restarts during deployment can cause 503 errors.

- All publishers going down means no fallback for traffic.

- Health checks may lag, causing traffic to hit half-ready instances.

- If there's a bug in the package, rollback becomes chaotic.

- Dispatcher cache eviction and warming will hit all publishers at once.

Rolling Deployment Strategy

Deploy in batches (2 at a time if you have 4 publishers) while others continue serving traffic.

Assume Publishers: pub1, pub2, pub3, pub4

Batch 1: pub1 & pub2

- Remove from Load Balancer (use API or route removal).

- Install Code Package (via curl or Jenkins deploy plugin).

- Trigger Bundle Stabilization Check
Example:

curl -u admin:admin http://pub1:4503/system/console/bundles/.json | jq '.status'

Verify Health
Use custom /bin/healthcheck endpoint or:

curl -f http://pub1:4503/libs/granite/core/content/login.html

Add back to Load Balancer

Batch 2: pub3 & pub4

Repeat the same steps.

Regards,
Amit

K_M_K_Srikanth · 5/21/25

Hi @sanjeevkumart45

In the context of an AEM (Adobe Experience Manager) on-premises instance, this can be broken down into two distinct scenarios:

If your AEM website is serving primarily static content (e.g., cached pages, clientlibs, assets):
During deployments, you can safely remove all AEM Publish instances from the load balancer, ensuring no live traffic hits them. Since the content is cached (either at dispatcher or CDN level), end users will continue receiving responses without disruption. This allows for a zero-downtime deployment process while avoiding potential cache invalidation or content inconsistency issues.
If your AEM site includes dynamic content or integrations with backend/external services (e.g., APIs, commerce systems, personalization):
In such cases, your current approach—likely involving a rolling deployment or blue-green deployment strategy—is more suitable. This ensures that there’s always at least one active and stable publish instance handling live requests, minimizing the risk of downtime or broken integrations during the deployment process.

kautuk_sahni · 6/27/25

@sanjeevkumart45 Just checking in — were you able to resolve your issue?
We’d love to hear how things worked out. If the suggestions above helped, marking a response as correct can guide others with similar questions. And if you found another solution, feel free to share it — your insights could really benefit the community. Thanks again for being part of the conversation!

Adobe Experience Manager Sites & More

Deploying code package to all publishers

Minimize bundle disruption during deployment

Santosh Sai

Step-by-Step CI/CD Flow:

How to Make This Even Safer

1. Robust Health Checks

2. Dispatcher Caching

3. Retry Logic in Load Balancer

Kautuk Sahni

Learn

Documentation

Events

Community

Support

Resources

Adobe account

Adobe