Adobe Experience Manager Sites & More

swathib78201134 · 8/17/16

Hi All,

We use tarMk replication process to replicate from our Primary author to two stand-by authors.We had following activities going on the servers.

Back of the dam assets. They were built in small packages and were deleted after downloading.
Ran offline Compaction successfully on the three servers.
Auto injection of the pages.

At the time of auto-injection, we had noticed that tarmk-log is stopped. The replication was stopped for more than 10 hours completely. So, we recycled all the three servers. Replication went well for couple of hours after restarting but again stopped. Below are the tarMk logs

16.08.2016 20:17:52.759 *INFO* [FelixStartLevel] org.apache.jackrabbit.oak.plugins.segment.file.FileStore TarMK closed: <path>/repos/segmentstore

16.08.2016 20:17:52.759 *ERROR* [defaultEventExecutorGroup-5-1] org.apache.jackrabbit.oak.plugins.segment.standby.client.StandbyClientHandler Exception caught, closing channel.

java.lang.IllegalStateException: null

at com.google.common.base.Preconditions.checkState(Preconditions.java:134)

at org.apache.jackrabbit.oak.plugins.segment.file.TarWriter.containsEntry(TarWriter.java:167)

at org.apache.jackrabbit.oak.plugins.segment.file.FileStore.containsSegment(FileStore.java:814)

at org.apache.jackrabbit.oak.plugins.segment.file.FileStore.containsSegment(FileStore.java:803)

at org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStore.readSegment(StandbyStore.java:93)

at org.apache.jackrabbit.oak.plugins.segment.SegmentTracker.getSegment(SegmentTracker.java:136)

at org.apache.jackrabbit.oak.plugins.segment.SegmentId.getSegment(SegmentId.java:108)

at org.apache.jackrabbit.oak.plugins.segment.Record.getSegment(Record.java:82)

at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.getTemplate(SegmentNodeState.java:79)

at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:447)

at org.apache.jackrabbit.oak.plugins.segment.standby.client.SegmentLoaderHandler.initSync(SegmentLoaderHandler.java:105)

at org.apache.jackrabbit.oak.plugins.segment.standby.client.SegmentLoaderHandler.channelActive(SegmentLoaderHandler.java:78)

at org.apache.jackrabbit.oak.plugins.segment.standby.client.StandbyClientHandler.setHead(StandbyClientHandler.java:105)

at org.apache.jackrabbit.oak.plugins.segment.standby.client.StandbyClientHandler.channelRead0(StandbyClientHandler.java:77)

at org.apache.jackrabbit.oak.plugins.segment.standby.client.StandbyClientHandler.channelRead0(StandbyClientHandler.java:39)

at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)

at io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32)

at io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324)

at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:36)

at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)

at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)

at java.lang.Thread.run(Thread.java:745)

16.08.2016 20:37:58.132 *INFO* [FelixStartLevel] org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStoreService Component still not activated. Ignoring the initialization call

16.08.2016 20:37:58.160 *INFO* [FelixStartLevel] org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStoreService Initializing SegmentNodeStore with BlobStore [DataStore backed BlobStore [org.apache.jackrabbit.oak.plugins.blob.datastore.OakFileDataStore]]

16.08.2016 20:37:58.322 *INFO* [FelixStartLevel] org.apache.jackrabbit.oak.plugins.segment.file.FileStore TarMK opened: <path>/repos/segmentstore (mmap=true)

16.08.2016 20:37:58.473 *INFO* [FelixStartLevel] org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStoreService SegmentNodeStore initialized

16.08.2016 20:37:59.134 *INFO* [FelixStartLevel] org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService started standby sync with 159.231.172.44:40901 at 5 sec.

16.08.2016 22:53:21.200 *ERROR* [defaultEventExecutorGroup-4-1] org.apache.jackrabbit.oak.plugins.segment.standby.client.SegmentLoaderHandler Exception caught, closing channel.

io.netty.handler.timeout.ReadTimeoutException: null

Please advice

AnkurAhlawat · 8/21/16

Hi Swathi,

I have tried to replicate mentioned scenario below are my findings:-

Test Case:- Ran offline tar compaction on standby

Result:- After 3 days sync not started, standby size is lower than primary.

So, as advised by opkar it is better and safe to copy paste primary repo to standby instance.

Note:- I am using AEM 6.1

View solution in original post

Opkar_Gill · 8/17/16

Hi Swathi,

first let's sort out the terminology, it's not replication that is used in Cold Standby setup. There is a sync process that is initiated from the standby to pull the latest head revision from the primary instance.

If you perform any heavy activity on your primary server, I would recommend you wait for this to complete before stopping the primary instance to perform offline compaction.

Unfortunately the docs are incorrect and you should not be running offline compaction on the standby instances. Instead, as when you are installing a hot fix or service pack, you should re-create your standby instances after running offline compaction on the primary instance. This is error had been raised internally and the docs team should be updating the documentation shortly.

There have been cases in the past where it takes a long time for the standby to synch with the primary after running offline compaction on the primary, which is why it's recommended to re-create the standby. Another issues was that the standby instance would grow several factors larger than the primary instance.

Regards,

Opkar

[0]https://docs.adobe.com/docs/en/aem/6-2/deploy/recommended-deploys/tarmk-cold-standby.html

AnkurAhlawat · 8/17/16

Nice explanation opkar, me too was wondering why adobe has suggested not to run offline compaction on standby instance in our project.

Have you found root cause for this issue, why standby size is increasing if we run offline compaction on standby.

Opkar_Gill · 8/18/16

Hi,

According to this OAK JIRA issue, it's due to the "size = old standby size + primary size", hence at least doubling in size. Running compaction on the standby, will most likely not result in the same repository as running compaction on the primary, hence in essence the primary repository is copied across to the standby, even after running compaction on the standby.

Regards,

Opkar

[0]https://issues.apache.org/jira/browse/OAK-2535

AnkurAhlawat · 8/18/16

But as mention on jira https://issues.apache.org/jira/browse/OAK-2535, this issue is fixed and closed. Does this means that issue is fixed from aem 6.2 or is it still there.

Opkar_Gill · 8/18/16

the ticket was just meant to show the reason behind the growth. The issue is still in 6.2, so there may be other relevant issues in JIRA which are still to be resolved.

swathib78201134 · 8/19/16

Hi Opkar,

Thank you very much for the reply.Can you please clarify below inline responses.

If you perform any heavy activity on your primary server, I would recommend you wait for this to complete before stopping the primary instance to perform offline compaction. - If you are referring to dam package built in primary author, Yes, we have finished downloading and deletion of that package by 11:30 and offline Compaction ran at 3:00 AM. The second time when we restarted the servers the sync process went fine for couple of hours and then stopped as mentioned in my previous post.

Instead, as when you are installing a hot fix or service pack, you should re-create your standby instances after running offline compaction on the primary instance. - We were not installing any hot fix or service pack. Rather we were running offline compaction on all three servers from past 5 months. As you mentioned, it always takes long time for standby to sync with the primary after running offline compaction on the primary but haven't seen this long. It's been more than 3 days now. If we wait to wait, for how long we can wait. Please suggest.

We have scheduled offline compaction on our servers for twice a week. So in that case it would not be possible to re-create the standby every time we run offline compaction. We haven't notice the growth in standby authors as well.

Opkar_Gill · 8/19/16

Hi Swathi,

can I ask why you are running offline compaction twice a week? The guidance in the docs is every two weeks. But this can be sooner if you have a lot of activity on your instance and the size of the primary is growing rapidly. The guidelines are when you repository is twice it's original size or you have reached 50% of you disk space. How big is your repository and are you using an external file datastore?

If the sync process has not finished after 3 days I would suggest getting touch with daycare about this.

Regards,

Opkar

swathib78201134 · 8/19/16

Hi Opkar,

We run offline compaction weekly on all primary and standby authors as per suggestion from Adobe in a daycare ticket.Yes, we use external file data store. Repository size is as below:

Standby Author:

28G datastore
2.2G segmentstore
40G 5.1G 33G 14% /apps/logs
256G 50G 194G 21% /apps/aemdomains

Primary Author:

28G datastore
2.2G segmentstore
976M 286M 640M 31% /apps
255G 76G 167G 32% /apps/aemdomains
40G 3.6G 34G 10% /apps/logs

Opkar_Gill · 8/19/16

Hi Swathi,

I would definitely raise a daycare ticket about the long sync time, it shouldn't take days to sync the standby instance.

Regards,

Opkar

AnkurAhlawat · 8/21/16

Hi Swathi,

I have tried to replicate mentioned scenario below are my findings:-

Test Case:- Ran offline tar compaction on standby

Result:- After 3 days sync not started, standby size is lower than primary.

So, as advised by opkar it is better and safe to copy paste primary repo to standby instance.

Note:- I am using AEM 6.1

AnkurAhlawat · 8/21/16

Hi Opkar,

I have a confusion regarding taking backup of primary.
1. When taking backup(crx-quickstart) of primary instance should we stop standby ?
OR
2. Just need to stop sync in JMX of standby( standby instance will be running) and take backup of primary.

Can you please update?

Adobe Experience Manager Sites & More

TarMK replication is not working

Learn

Documentation

Events

Community

Support

Resources

Adobe account

Adobe