Our current deploy to Production takes around 3 hours, I would like to get some feedback on how we can improve this to reduce it to less than 1 hour.
We have 16 AEM publish instances in 2 data centers. We install AEM packages using http requests. We use Akamai to cache the static pages. Here are the steps to our deployment
Some challenges we run into
3 hours is not that bad ... but of course there's room for improvement.
First of all, the installation of the packages should be optimized.
* Unchanged JCR nodes must be not removed and recreated again. This is especially important for OSGI configuration, because this is often causing restarts of product services, which can take quite some time. It might be help to extract the OSGI configuration nodes into a dedicated content-package, so you can fine-tune the filter rules.
* Same for bundles. Here it helps, if you can version each bundle individually. If a bundle has not changed, no new version is created, but the old version is deployed again. Reduces deployment time and has some other useful benefits.
* Observe the logfiles and check what services are restarted on a deployment of your services. Understand why this happens and if you can influence or avoid it.
This is quite some work, but it really pays off. Combining all packages into a single package can help, but that depends a lot on your individual circumstances.
Some other remarks:
* If new bundle versions are not picked up, but the old ones remain, please raise a Daycare ticket. I also observed it very rarely, but it did happen. Don't have a solution for it though 😕
* There is not 100% way to ensure that all installations work flawlessly. You need to work on the deployment package structure as above and reduce the amount of activity happening because of it.
* Use healthchecks to monitor the progress of the deployment, especially to understand if the system is working properly afterwards.
Regarding timing: In my last project we did publish deployments (using the same blue-green deployment approach as you) in a bit more than 1 hour. And the biggest chunk of time (15 min each) is spent on draining the sessions on the to-be-deployed instances. The installation of the content packages (~ 100 bundles plus lot of content packages) took about 10 minutes.
So it's doable, you just need to understand where your time is actually spent.
The issue of the bundles not getting updated by the package install is likely due to some performance issue with the application, meaning some thread is still using the classes from the older version. I would suggest you capture thread dumps from the instance and see what the stack traces are to debug the issue. If you need help debugging why the bundles aren't updating then please file a case with AEM support.
In addition, it is generally suggested to pull each instance off of traffic / load balancer during a deployment then put it back after. This might solve the problem of failed bundle updates.
To speed up the deployment, I would also suggest automating the deployments and automate as many test cases as possible. In addition, implement a crawler to prime the dispatcher cache after deploy, this helps avoid post-deployment crashes.
1. Ensure you have a good versioning strategy for your bundles as felix has a way of picking priority. Ideally you may not need the restart step as felix should refresh itself.
2. Your pipeline is of good standards. It will be good to see the timeline breakdown of these steps in the 3 hours to understand which step is taking time.
Bundle deployments can get fragile, because they might trigger different kind of activities within the OSGI container, and this totally depends on your code plus runtime randomness. The best way to get around this is to carefully evaluate your bundle and not to export java packages already exported by others.
Regarding the "unchanged JCR nodes must not be removed and recreated again": With package filters you can force the packagemanager to cleanup the path before installing into it again (I never got my head around which parameter does what, but you can clearly see it in the package installation logs). You should avoid to have this filter, because especially for OSGI configurations this can be expensive in terms of runtime.
And on top: As Andrew wrote, automate your checks and deployments as good as you can. The example I gave above uses automation heavily. Basically there is not a single manual step in there (and it is working in 99,9% of all cases flawlessly), otherwise it would not be possible.
Can you elaborate on this ?
Can you give an example on what you mean ?
One thing I have noticed is that after a package is deployed, sometime not all the bundles are in active state. I believe it's because we are install the next package before all the bundles are active.
A lot of our time is spent on the verification step since we are not confident the code package is deployed properly to all AEM publish instance. I'm trying to see how other are solving this problem.