Expand my Community achievements bar.

Submissions are now open for the 2026 Adobe Experience Maker Awards.
SOLVED

Deploying code package to all publishers

Avatar

Level 2

We are currently using an on-premises AEM instance. As a best practice, we typically remove two publishers from the load balancer, deploy the code packages to them, and then repeat the process for the remaining publishers. This helps avoid user impact or traffic disruption. However, this approach is slow and challenging to automate through a CI/CD pipeline.

We're considering deploying to all publishers simultaneously to streamline the process. Are there any risks associated with this approach? Additionally, how can we ensure that user traffic remains unaffected during bundle restarts following the deployment?

1 Accepted Solution

Avatar

Correct answer by
Level 3

Hi @sanjeevkumart45 

 

In the context of an AEM (Adobe Experience Manager) on-premises instance, this can be broken down into two distinct scenarios:

  1. If your AEM website is serving primarily static content (e.g., cached pages, clientlibs, assets):
    During deployments, you can safely remove all AEM Publish instances from the load balancer, ensuring no live traffic hits them. Since the content is cached (either at dispatcher or CDN level), end users will continue receiving responses without disruption. This allows for a zero-downtime deployment process while avoiding potential cache invalidation or content inconsistency issues.

  2. If your AEM site includes dynamic content or integrations with backend/external services (e.g., APIs, commerce systems, personalization):
    In such cases, your current approach—likely involving a rolling deployment or blue-green deployment strategy—is more suitable. This ensures that there’s always at least one active and stable publish instance handling live requests, minimizing the risk of downtime or broken integrations during the deployment process.

 

View solution in original post

7 Replies

Avatar

Community Advisor

Hi @sanjeevkumart45,

Yes, there are some risks as below:

  1. Bundle Restarts Cause Temporary Downtime
    Deploying code packages often causes OSGi bundles to restart. During this time:

    • Sling may return 503s (Service Unavailable).

    • Pages may load partially or fail completely.

    • User experience degrades if all publishers are affected simultaneously.

  2. Cache Invalidation or Inconsistency
    If you're using a dispatcher or CDN (e.g., Akamai), simultaneous invalidation or rebuild across all publishers may:

    • Result in cache thrashing.

    • Increase origin traffic.

    • Slow down performance due to concurrent repopulation.

  3. Session or Token Loss (if applicable)
    Authenticated sessions or login flows could be interrupted during deployment if the backend (publish) is unstable, especially in secure applications.

  4. Harder to Roll Back
    If something breaks, recovering is slower since all publishers need to be reverted or fixed — rather than just one subset.

It's always recommended to follow Blue-Green approach: 

Instead of all-at-once:

  • Keep your publisher nodes grouped (e.g., Group A and Group B).

  • Deploy to Group A, verify health.

  • Then switch traffic and deploy to Group B.

This is safer and can still be automated.

Minimize bundle disruption during deployment

Structure your deployment to:

  • Avoid deploying large number of bundles or deeply interdependent code at once.

  • Use OSGi configurations and content packages that don't always trigger full restarts.

  • Pre-process packages using package filters to avoid unnecessary overwrite of existing resources.

References:

Hope that helps!


Santosh Sai

AEM BlogsLinkedIn


Avatar

Level 2

Hi @sanjeevkumart45,

 

Yes, there are definite risks with deploying to all publishers simultaneously:

 

  1. Service Downtime or Disruption
    All publishers restart bundles at the same time, potentially causing temporary unavailability or errors for users.

  2. Load Balancer Health Check Failures
    If health checks are not precise or slow to detect the node's unhealthy state during restarts, traffic may be routed to nodes still coming up, causing failures.

  3. Widespread Impact of Bugs
    Any deployment bug affects all publishers immediately, causing a full outage rather than a limited impact.

  4. Difficult Rollback
    Rolling back a faulty deployment is more complex because all nodes are updated simultaneously.

  5. Cache Invalidation Issues
    Simultaneous bundle restarts may cause large-scale cache invalidation or warming delays, impacting performance temporarily.

  6. User Experience Degradation
    Users may experience errors, slow page loads, or failed asset deliveries during the restart window.

 

Here are some strategies to mitigate the risks of deploying simultaneously to all publishers and to ensure user traffic remains unaffected during bundle restarts after deployment, you can follow these best practices:

 

  1. Use Robust Health Checks:
    Configure the load balancer to perform detailed health checks that verify the full readiness of each publisher (not just basic server up/down). Only route traffic to publishers that have completed bundle restarts and are fully operational.

  2. Graceful Shutdown and Startup:
    Implement graceful shutdown hooks on publishers so they stop accepting new requests before restarting bundles, allowing ongoing requests to finish without disruption.

  3. Leverage Caching Layers:
    Utilize AEM Dispatcher caching along with any CDN or reverse proxy caching to serve user requests while publishers are restarting. This reduces load on publishers and minimizes user-visible downtime.

  4. Handle Direct Application-Level Requests:
    Note that caching primarily helps with static or cacheable content. Requests involving query parameters, servlets, or other dynamic endpoints often bypass cache and hit the publisher directly. For these, robust health checks and load balancer routing are critical to avoid sending traffic to restarting instances. Consider session draining or designing stateless, retryable endpoints to improve resilience during restarts.

  5. Deploy During Low-Traffic Periods:
    Schedule deployments during off-peak hours to minimize the number of affected users in case of transient issues.

  6. Monitoring and Alerts:
    Set up real-time monitoring on health check endpoints, error rates, and response times to detect and respond quickly if issues arise.

  7. Quick Rollback Plan:
    Always have a rollback plan ready to quickly revert to the previous stable version if unexpected issues occur during or after deployment. This minimizes downtime and user impact.

Overall, the goal is to make deployments as smooth and low-risk as possible but zero risk is rarely achievable in complex systems. Therefore, implementing a rolling deployment strategy, is generally the safest approach to minimize risk while maintaining availability.

 

 

Recommended Approach : Rolling Deployment - A deployment strategy where updates are gradually rolled out to a subset of servers (or nodes) at a time. Commonly used in Kubernetes, AWS, Azure DevOps, CI/CD pipelines.

 

Let’s say you have 4 publishers: P1, P2, P3, P4.

Step-by-Step CI/CD Flow:

  1. Batch 1: P1 and P2

    • Remove from load balancer or mark as draining.

    • Deploy packages.

    • Wait for:

      • Bundle stability (system/console/bundles status check).

      • Health check to return 200 (/status.html, custom health endpoint).

    • Add P1 and P2 back.

  2. Batch 2: P3 and P4

    • Same as above.

At no point is the entire publishing layer down, you maintain uptime while reducing overall deployment time.

 

How to Make This Even Safer

To ensure traffic remains unaffected, add these:

1. Robust Health Checks

  • Custom endpoint like /bin/healthcheck that checks:

    • Sling readiness

    • Key bundles (WCM, DAM, Granite, Project Core)

    • Dispatcher cache

2. Dispatcher Caching

  • Pre-warm your dispatcher cache after each batch deploys to avoid slow first-page loads.

3. Retry Logic in Load Balancer

  • Ensure your LB has retry logic when a node is slow/unavailable temporarily.

 

Here’s a clear comparison table between Deploying to All Publishers Simultaneously and Rolling Deployment for your on-prem AEM setup:

 

AspectDeploying to All Publishers SimultaneouslyRolling Deployment
SpeedFastest — all updates happen at onceSlower — updates happen in batches
User Impact RiskMedium to high — all publishers restart at once, may cause downtime or errorsLow — some publishers remain live, minimizing disruption
AutomationEasier to automate in CI/CD pipelinesRequires orchestration but still automatable
Load Balancer DependencyMust have very robust health checks to avoid routing traffic to restarting nodesHealth checks verify small batches, reducing risk
Rollback ComplexityHigher — entire environment affected at onceEasier — rollback limited to updated subset
Operational ComplexityLower — single step deploymentModerate — staged deployment and monitoring
Best Use CaseNon-critical environments or very robust infraProduction environments needing high availability

 

Hope this helps!

 

 

Avatar

Level 8

Hi @sanjeevkumart45 ,

While others have already highlighted the impact of deploying simultaneously to all publishers, I’d like to share our experience. In our project, we had a similar setup and implemented a shell script to automate the deployment process. The script removes a publisher from the load balancer, deploys the code, verifies the health status, and then adds it back to the load balancer. This approach has allowed us to automate the deployment steps to a good extent, supporting our CI/CD goals.

An alternative approach often suggested is the blue-green deployment model. While effective, we found it to be cost-prohibitive in an on-premises environment due to the need for duplicate infrastructure.

Avatar

Level 5

Hi @sanjeevkumart45 

 

You need to rely on dispatcher caching first to make sure the content is cached, so the requests are not coming to the AEM publish and being returned via cache only first.

 

You can write a shell script in Jenkins to automate that will first remove a publish instance from the load balancer, install the required code package, check for the system health and make a call to the homepage to check the status code and once it's done reattach and perform the same steps on the next publish instance

Avatar

Community Advisor

Hi @sanjeevkumart45 ,

Deploying simultaneously to all publishers risks end-user disruption, even for a few seconds:

  - Bundle restarts during deployment can cause 503 errors.

  - All publishers going down means no fallback for traffic.

  - Health checks may lag, causing traffic to hit half-ready instances.

  - If there's a bug in the package, rollback becomes chaotic.

  - Dispatcher cache eviction and warming will hit all publishers at once.

Rolling Deployment Strategy

Deploy in batches (2 at a time if you have 4 publishers) while others continue serving traffic.

Assume Publishers: pub1, pub2, pub3, pub4

Batch 1: pub1 & pub2

  - Remove from Load Balancer (use API or route removal).

  - Install Code Package (via curl or Jenkins deploy plugin).

  - Trigger Bundle Stabilization Check
Example:

curl -u admin:admin http://pub1:4503/system/console/bundles/.json | jq '.status'

Avatar

Correct answer by
Level 3

Hi @sanjeevkumart45 

 

In the context of an AEM (Adobe Experience Manager) on-premises instance, this can be broken down into two distinct scenarios:

  1. If your AEM website is serving primarily static content (e.g., cached pages, clientlibs, assets):
    During deployments, you can safely remove all AEM Publish instances from the load balancer, ensuring no live traffic hits them. Since the content is cached (either at dispatcher or CDN level), end users will continue receiving responses without disruption. This allows for a zero-downtime deployment process while avoiding potential cache invalidation or content inconsistency issues.

  2. If your AEM site includes dynamic content or integrations with backend/external services (e.g., APIs, commerce systems, personalization):
    In such cases, your current approach—likely involving a rolling deployment or blue-green deployment strategy—is more suitable. This ensures that there’s always at least one active and stable publish instance handling live requests, minimizing the risk of downtime or broken integrations during the deployment process.

 

Avatar

Administrator

@sanjeevkumart45 Just checking in — were you able to resolve your issue?
We’d love to hear how things worked out. If the suggestions above helped, marking a response as correct can guide others with similar questions. And if you found another solution, feel free to share it — your insights could really benefit the community. Thanks again for being part of the conversation!



Kautuk Sahni