As part of our 6.5 upgrade, we've decided to derive a standby instance from the primary author instance. I've thoroughly followed the steps mentioned in the official adobe documentation. However, when I goto the JMX console on the primary author, It says 'channel unregistered'. Whereas, on the standby instance, the status is stuck at 'initializing'. I've restarted these instances multiple times only to run into the same problem over and over again. Any pointers about how to troubleshoot this problem would be greatly appreciated. Thanks!
Hey @Pavan_kalyan . Unfortunately, I haven't had any luck resolving it. I did follow every step as mentioned here. My last resort is to try and remove all the old config files on the primary instance, set up new configuration files and then derive a standby instance from it. I'm going to try it in a few hours. Hopefully that resolves it.
Also, I see the 'writing segments' and 'loading segments' logs on the primary instance tarmk-coldstandby.log file. However, all I see on the standby instance logs is ''head state did not change: skipping flush". There's something preventing the standby from establishing a TCP/IP connection with the primary instance. I noticed that the TCP port on our primary instance is active. Also, the standby entry for the standby instance (with standby's clientid) on the jmx console on our primary instance is missing, which sort of confirms that the connection hasn't been established yet.
Update: I did some more digging and realised I have all the right configurations in place. The TCP port '8088' (This is the one we're using instead of the default 8023) on primary is active and listening. However, the standby isn't able to establish a connection with it. I tried using curl commands to verify if it's possible or not. Any idea why this could be happening or where to look within AEM in order to troubleshoot this problem (except for SegmentNodeStoreService.config and StandbyStoreService.config).
Hi, We are having the same issue it seems.
I have noticed that the "Standby" never EVER makes the requests to the primary for some reason.
On the Standby server, i did a curl -v http://ipaddress:8088 and got a response back indicating the port was open.
On the primary, i saw in the log "ClientFilterHandler Client /10.x.x.x:randomhighport is allowed.
Then on the following line, io.netty.handler.codec.DecoderException: not an SSL/TLS record.
Despite the fact we have configured for a NON-SECURE cold standby.
I cant make sense of why an 6.4 AEM thats had a in-place upgrade to v6.5 is failing like this.
We noticed the exact same error this morning!! I'm so relieved that at least I can see a relevant error. It was frustrating having to just monitor things with no luck. Also, I created two fresh AEM 6.5 instances and setting up this connection between the two was a breeze, so that confirms that the steps I've been following were correct all along.
Also, please post here if you do manage to resolve it. Thanks 🙂
Sounds liek your repeating the exact same tests and environment checks that we are as well.
If we have any success in isolating the exact root cause we will share.
At this point our theory is that the http:// request from the standby server using CURL is hitting the primary http:// interface and AEM is doing some very hidden rewrite/redirect to https:// .
The CURL request is NOT going through a system proxy either, so we are now focusing on the primary author and what it is doing.
We havent been successful in resolving this.
But we did a few things today to try and isolate it.
1: Built 2 new servers
2: Setup AEM v6.4 on one as the 'primary' then shut it down and copied it to the 'standby'
3: Configured the 'standby' and verified it was making the requests to the primary and pulling data across.
4: Installed the latest v6.4 servicepack and retested that 'Standby' was still synchroning
5: Stopped the 'standby', then did an in-place upgrade on the 'primary' to v6.5
6: Stopped the primary and coppied it to the 'standby' server and reconfigured to startup as a 'standby'.
7: Started up 'standby' server and checked the logs, to see if the data was synchronising. It was.
So what we have proven is a clean install of 6.4 then inplace upgrades with a cold standby works.
But our original problem on our original servers still have the problem that doesnt appear to be very easy to track down.
No matter what we have tried we cannot locate a valid reason why AEM is converting a http:// request to a https:// request and triggerring a ssl error.
We appear to have resolved this after a lot of testing.
We are awaiting Adobe Support to confirm and validate the bug/ and fix ... but for us our fix was to install oak-segment-tar-1.24.0.jar on the primary , then give it a restart.
Then once it was up we ran our curl command to see if it would be rejected or not and if an ssl error woud occur.
For us it wasnt rejected and we got no ssl error so we shutdown the primary and copied it to another server to create a standby and followed the known process.
As soon as we fired up the standby it worked and synced as expected.
The problem apparently lies in the io.netty library... so v1.29.0 was compatible it seems with AEM v6.5.11 which we were using.
Be careful, good luck. Hopefully more documentation comes through and a fix appears in a servicepack.
hi @Alisahali , Sorry for the delayed response as I was away for a while.
Have you installed latest service pack and tried ? If not can you please refer https://experienceleague.adobe.com/docs/experience-manager-65/release-notes/release-notes.html?lang=...
and try installing the same.
Once done restart and check, let me know if that doesnt help. Many thanks.
@Alisahali do you see any relevant error logs ? If yes, can you share them, I am trying to replicate the issue , I disabled the jmx console provider but in that case the JMX console totally disappears and I am not able to catch up with the error 'channel unregistered' that you have mentioned.
we ran into the same problem with a clean install of AEM 6.5+SP12. Everything seems correct but the standby seems not to call the primary istance. From the standby or primary log I didn't find anything relevant.
Trying a curl to the primary also results in error "io.netty.handler.codec.DecoderException: not an SSL/TLS record"
In the meantime I will try with the workaround suggested by @jugs installing oak-segment-tar-1.24.0.jar
Try v1.22.9 of the oak-segment-tar which is what adobe support gave us and said will be in the next service pack.
It worked for us as well.
As mentioned earlier on this thread, the bug resides in a third party library called netty that oak-segment-tar uses. Note that this bug was fixed in netty version 4.1.68.
Ensure that whatever version of oak-segment-tar you're using incorporates a non faulty version of netty. netty version 4.1.68 and 4.1.14 are the ones I've experimented with and both seem to work with no dependency issues. oak-segment-tar v1.24.0 would be your best bet if you haven't heard back from Adobe yet (in case you have an open adobe support ticket). I remember them sending us a faulty version of oak-segment-tar v1.22.9 that was using netty 4.1.66. They updated the netty version to 4.1.68 and resent the jar which then resolved the issue for us.
Hope that helps. Please let us know how that works out for you. Would be really helpful to people facing the same problem, given that there's barely any documentation out there. Thanks!
I just double checked oak-segment-tar v1.22.9 that's available on the central maven repository. It uses netty v4.1.66 which contains the bug. Attaching a screenshot here for your reference.
I'd recommend trying oak-segment-tar v1.24.0. In the meantime you could create an adobe support ticket if you haven't already created one and ask adobe for a patched version of oak-segment-tar v1.22.9.