Expand my Community achievements bar.

Cold standby instance not syncing plus the primary instance says channel unregistered

Avatar

Level 4

Hello there,

As part of our 6.5 upgrade, we've decided to derive a standby instance from the primary author instance. I've thoroughly followed the steps mentioned in the official adobe documentation. However, when I goto the JMX console on the primary author, It says 'channel unregistered'. Whereas, on the standby instance, the status is stuck at 'initializing'. I've restarted these instances multiple times only to run into the same problem over and over again. Any pointers about how to troubleshoot this problem would be greatly appreciated. Thanks!

 

 

P.S. Tagging @milind_bachani and @Pavan_Kalyan for a quick response. As always, thanks in advance guys for being sooo helpful

15 Replies

Avatar

Level 4

Hi @Alisahali 

I believe you must me missing something while doing the process.

Please refer this below link for reference 

https://www.youtube.com/watch?v=Sr0n-P5PnhA

kindly revert me back if you still face the same issue.!

Thanks!

Avatar

Level 4

Hey @Pavan_Kalyan . Unfortunately, I haven't had any luck resolving it. I did follow every step as mentioned here. My last resort is to try and remove all the old config files on the primary instance, set up new configuration files and then derive a standby instance from it. I'm going to try it in a few hours. Hopefully that resolves it.

 

Also, I see the 'writing segments' and 'loading segments' logs on the primary instance tarmk-coldstandby.log file. However, all I see on the standby instance logs is ''head state did not change: skipping flush". There's something preventing the standby from establishing a TCP/IP connection with the primary instance. I noticed that the TCP port on our primary instance is active. Also, the standby entry for the standby instance (with standby's clientid)  on the jmx console on our primary instance is missing, which sort of confirms that the connection hasn't been established yet. 

Avatar

Level 4

Update: I did some more digging and realised I have all the right configurations in place. The TCP port '8088' (This is the one we're using instead of the default 8023) on primary is active and listening. However, the standby isn't able to establish a connection with it. I tried using curl commands to verify if it's possible or not. Any idea why this could be happening or where to look within AEM in order to troubleshoot this problem (except for SegmentNodeStoreService.config and StandbyStoreService.config).

Avatar

Level 2

Hi,   We are having the same issue  it  seems.

 

I have  noticed that the "Standby"  never EVER makes the requests to the primary for some reason.

 

On the Standby server,  i  did a  curl -v http://ipaddress:8088     and  got a  response back indicating the port was open.

On the primary,  i saw in the log "ClientFilterHandler Client /10.x.x.x:randomhighport is allowed.

Then on the following line,    io.netty.handler.codec.DecoderException:   not an SSL/TLS record.

 

Despite the fact we have configured for a NON-SECURE cold standby.

 

I cant make sense of  why an  6.4 AEM  thats had a  in-place upgrade to  v6.5  is failing like this.

Avatar

Level 4

We noticed the exact same error this morning!! I'm so relieved that at least I can see a relevant error. It was frustrating having to just monitor things with no luck. Also, I created two fresh AEM 6.5 instances and setting up this connection between the two was a breeze, so that confirms that the steps I've been following were correct all along.

 

Also, please post here if you do manage to resolve it. Thanks

Avatar

Level 2

Sounds liek your repeating the exact same tests  and environment checks that we are as well.

 

If we have any success in isolating the exact root cause we will share.

 

At this point  our theory is that the http:// request from the standby server using CURL is hitting the primary http://  interface  and AEM is doing some very hidden rewrite/redirect to https://  .

 

The CURL request is NOT  going through a system proxy either,  so we are now focusing on the primary author and  what it is doing.  

 

Avatar

Level 2

We havent been successful  in resolving this.  

 

But we did a few things today to  try and isolate it.

 

1: Built 2 new servers

2: Setup AEM v6.4 on one as the 'primary' then shut it down and copied it to the 'standby'

3: Configured the 'standby' and verified it was  making the requests to the primary and pulling data across.

4: Installed the latest v6.4 servicepack and retested that  'Standby' was still synchroning

5: Stopped the 'standby', then did an in-place upgrade on the 'primary' to v6.5

6: Stopped the primary and coppied it to the 'standby' server and reconfigured to startup as a 'standby'.

7: Started up 'standby' server  and checked the logs,  to see if the data was synchronising. It was.

 

So what we have proven is a clean install of 6.4 then inplace upgrades  with a cold standby works.

 

But our  original problem on our original servers still have the problem  that doesnt appear to be very easy to  track down.

No matter what we have tried  we cannot locate a valid reason why AEM is converting a http:// request to a https://  request and  triggerring a ssl error.

 

 

 

Avatar

Level 2

We appear to  have resolved this after a lot of testing.

We are awaiting Adobe Support to confirm and validate the bug/ and fix ...  but for us our fix was to  install oak-segment-tar-1.24.0.jar  on the primary ,  then   give it a restart.

 

Then once it was up we  ran our curl command to  see  if it would be rejected or not  and if an ssl error woud occur.

 

For us  it wasnt rejected  and we got no ssl error  so  we  shutdown the primary  and  copied it to another server to create a standby  and followed the known  process.

 

As soon as we fired up the standby it worked and synced as expected.

 

The  problem apparently lies in the io.netty library...  so  v1.29.0   was compatible it seems with AEM v6.5.11  which we were using.

 

Be careful,  good luck.  Hopefully  more documentation comes through and a  fix appears in a servicepack.

 

 

Avatar

Employee Advisor

hi @Alisahali , Sorry for the delayed response as I was away for a while.

Have you installed latest service pack and tried ? If not can you please refer https://experienceleague.adobe.com/docs/experience-manager-65/release-notes/release-notes.html?lang=...
and try installing the same.

 

Once done restart and check, let me know if that doesnt help. Many thanks.

Avatar

Level 4

@milind_bachani yes we had already upgraded to 6.5.11.0 using the service pack, hence we can rule out the possibility of it being a service pack related issue. 

Avatar

Employee Advisor

@Alisahali do you see any relevant error logs ? If yes, can you share them, I am trying to replicate the issue , I disabled the jmx console provider but in that case the JMX console totally disappears and I am not able to catch up with the error 'channel unregistered' that you have mentioned.

Avatar

Level 1

Hi,

we ran into the same problem with a clean install of AEM 6.5+SP12. Everything seems correct but the standby seems not to call the primary istance. From the standby or primary log I didn't find anything relevant.

Trying a curl to the primary also results in error "io.netty.handler.codec.DecoderException:   not an SSL/TLS record"

 

In the meantime I will try with the workaround suggested by @jugs installing oak-segment-tar-1.24.0.jar 

 

Avatar

Level 2

Hi,

 

Try v1.22.9 of the oak-segment-tar which is what adobe support gave us and said will be in the next service pack.

 

It worked for us as well.

Avatar

Level 4

Hello there,

As mentioned earlier on this thread, the bug resides in a third party library called netty that oak-segment-tar uses. Note that this bug was fixed in netty version 4.1.68.

 

Ensure that whatever version of oak-segment-tar you're using incorporates a non faulty version of netty. netty version 4.1.68 and 4.1.14 are the ones I've experimented with and both seem to work with no dependency issues. oak-segment-tar v1.24.0 would be your best bet if you haven't heard back from Adobe yet (in case you have an open adobe support ticket). I remember them sending us a faulty version of oak-segment-tar v1.22.9 that was using netty 4.1.66. They updated the netty version to 4.1.68 and resent the jar which then resolved the issue for us. 

 

Hope that helps. Please let us know how that works out for you. Would be really helpful to people facing the same problem, given that there's barely any documentation out there. Thanks!

Avatar

Level 4

 

I just double checked oak-segment-tar v1.22.9 that's available on the central maven repository. It uses netty v4.1.66 which contains the bug. Attaching a screenshot here for your reference.

 

I'd recommend trying oak-segment-tar v1.24.0. In the meantime you could create an adobe support ticket if you haven't already created one and ask adobe for a patched version of oak-segment-tar v1.22.9.

Alisahali_0-1646445037609.png