Expand my Community achievements bar.

AEM 6.3: Author -> dispatcher flush also executes replication event to publishers

Avatar

Level 5

Let me first describe our 2 agents on our author environment:

- Dispatcher flush agent (we have a dispatcher in front of our author environment to add an extra layer of caching for the content authors)

- Default agent (used for publishing to the publish environment)

NOTE: obviously 2 different entities.

The dispatcher flush agent is new in our setup and gets trigger on modification. After adding the agent our content author complained that their changes were automatically published for some reason. After some investigation/debugging of the com.day.cq.replication.impl.ReplicatorImpl it seems that there is also a replication event fired by the RolloutManager with the following ReplicationOptions (note: no AgentIdFilter or ReplicateOnModification filter)

ReplicationOptions{synchronous=false, revision='null', suppressStatusUpdate=false, suppressVersions=false, filter=null, aggregateHandler=null}

No filter results in all enabled agents to be triggered so the path will get replicated to the publishers, NOT wat we want...

1669544_pastedImage_0.png

It seems to me the the combination of a dispatcher flush agent and default agent is not working properly. Anyone experiencing the same problems?

21 Replies

Avatar

Level 10

Since the flush agent is on author, could you validate that the "Triggers" tab of the flush agent doesn't have any checkbox turned on esp. Ignore Default or On Modification?

Are you doing programmatic replication/flushing or any workflow based replication/flushing of the content using ReplicationOptions?

Avatar

Level 5

This is our triggers tab:1670072_pastedImage_0.png
And no workflows are doing such a thing in our codebase at the moment. Wouldn't I see our custom class if we had one when debugging?

Avatar

Employee Advisor

In any case, the RolloutManager would always fire a replication event if it's configured in that way; it does not depend on the fact if a flush agent is configured or not (or even a combination of standard replication agent and flush agent).

I assume that it's just a coincidence of these 2 activities, and that the flush agent has nothing to do with it. Have you changed your rollout configuration recently to include replication?

Jörg

Avatar

Level 5

I am testing it on a local machine with a default and dispatcher flush agent pointing to our acceptance env. so it would can't be a coincidence because I see it every time and I am the only one testing on my local machine. We have the following rollout configs on our livecopy:

- /etc/msm/rolloutconfigs/pushonmodify
- /etc/msm/rolloutconfigs/activate
- /etc/msm/rolloutconfigs/deactivate

We had the above config from the start of the project and never had any pages publish automatically on edit. We only want the dispatcher flush on edit for now.

Avatar

Employee Advisor

Ah, that's a tough one then. But if you can replicate it, it's getting easier :-)

So you think that the invalidation agent is also triggering the default replication agent, which is the publishing the changed page?

Avatar

Level 5

Something like that is happening I think. Not sure how to fix it.

Avatar

Employee Advisor

How is your "invalidation" agent configured on author? You should have these checkboxes ticked on the "Triggers" tab:

* Ignore default

* On Modification

* No Status Update

* No Versioning

This should prevent a number of unwanted side effects on regular processes (especially "no status update" and "no versioning"). On the other hand, every invalidation request will also trigger an OSGI event about a successful replication event and I don't think that this can be suppressed.

On the other hand side I am not aware of any side effect of such an invalidation agent, it will definitely not trigger any other replication agent.

The problem you face might be caused indirectly by this, does your code react on these OSGI events?

Jörg

Avatar

Level 5

I've updated the trigger for the invalidation agent but it still seems to trigger a replication for some reason. If it was our own code wouldn't I be able to see it while debugging?

When I set the logging level for the agent to info I can see the following: https://pastebin.com/raw/kc6tVY4r On line 14 you can see 'com.day.cq.replication.impl.ReplicatorImpl' setting up a replication with no options for some reason:

*INFO* [Thread-1352] com.day.cq.replication.impl.ReplicatorImpl Setting up replication with options: ReplicationOptions{synchronous=false, revision='null', suppressStatusUpdate=false, suppressVersions=false, filter=null, aggregateHandler=null} 

I see no custom code intervening whatsoever.

UPDATE: I tested something on 2 different livecopies. On LiveCopy A and B I had 3 rollout configurations set:

- /etc/msm/rolloutconfigs/pushonmodify

- /etc/msm/rolloutconfigs/activate

- /etc/msm/rolloutconfigs/deactivate

When I removed '/etc/msm/rolloutconfigs/activate' from LiveCopy A, only LiveCopy B got automatically published when modifying the page. Ofcourse we need the '/etc/msm/rolloutconfigs/activate' to be there for every LiveCopy because we don't want our authors to publish each LiveCopy page individually. This really seems to be an AEM related bug.

Avatar

Level 10

I'm able to reproduce the behavior you mentioned only if I turn on "On modification" checkbox of either publish agent or flush/invalidation agent and this makes sense because 'pushonmodify' would trigger 'onModify' event which these agents would capture and trigger the replication & flush accordingly.

If 'on modification' is left unchecked on both publish agent and invalidation agent, I do not see any publish activity with either of 'pushonmodify' or 'activate' rollouts unless the author publishes the content manually. Even after removing 'pushonmodify' and doing a manual rollout doesn't trigger publish activity.

couple of questions/tasks:

  • How have you configured the flush agent to be triggered automatically on modification as you have mentioned in the description? Is it via "On modification" checkbox under "Triggers" tab or something else?
  • Do you have "On modification" checkbox turned on under "Triggers" tab for default publish agent?
  • Could you disable both default publish agent and dispatcher flush agent and check the blocked queues for both agents - http://localhost:4502/etc/replication/agents.author/flush.html  and http://localhost:4502/etc/replication/agents.author/publish.html
    • Validate the user name that triggers the replication and flush? Is it 'msm-service' or 'replication-service' or some other?
    • Validate the sequence of tasks for flush and replication. If pushonmodify rollout/msm triggers that replication, then you should see the user name as 'msm-service'. If its triggered by flush agent, then it would be with user 'replication-service'
  • Just to be clear - If you remove/disable the flush/invalidation agent then you do not observe this kind of auto-replication behavior? You could validate same with above mentioned step of checking publish agent queue/logs.
  • Could you remove 'pushonmodify' from live copies and do a manual rollout from source page and check the behavior if automated publish still happens?

reference - MSM Best Practices,

onModify

When using the rollout trigger onModify you should consider that:

  • Automating rollouts with onModify triggers may have a negative impact on authoring performance as they trigger rollouts after every page modification.
  • The rollout result may differ from the one expected as:
    • You cannot specify the order of the resulting modify events.
    • The event-based architecture cannot guarantee the sequence of the events passed to the Rollout Manager.
  • Using such a rollout configuration could lead to commit conflicts if concurrent updates of the same resource occur.

Therefore, it is recommended that you only use onModify triggers if the benefits of automatic rollout initiation outweigh any potential performance issues.

Avatar

Level 5

How have you configured the flush agent to be triggered automatically on modification as you have mentioned in the description? Is it via "On modification" checkbox under "Triggers" tab or something else?

The dispatcher flush agent is the only agent that has the trigger 'On Modification' checked on its triggers tab.

Do you have "On modification" checkbox turned on under "Triggers" tab for default publish agent?

The publish agent has no triggers checked on its triggers tab.

Could you disable both default publish agent and dispatcher flush agent and check the blocked queues for both agents http://localhost:4502/etc/replication/agents.author/flush.html  and http://localhost:4502/etc/replication/agents.author/publish.html

  • Validate the user name that triggers the replication and flush? Is it 'msm-service' or 'replication-service' or some other?
    • NOTE: I have not disabled the agents but just changed their target ip's because disabling resulted in seeing no queued items at all for both of them.
    • Editing a component
      • Dispatcher flush agent
        • component-edit-dispatcher-flush-agent.png
      • Publish agent

        • component-edit-publish-agent.png
    • Publishing the page
      • Dispatcher flush agent
        • publising-dispatcher-flush-agent.png
      • Publish agent
        • publising-publish-agent.png
  • Validate the sequence of tasks for flush and replication. If pushonmodify rollout/msm triggers that replication, then you should see the user name as 'msm-service'. If its triggered by flush agent, then it would be with user 'replication-service'
    • See answer above, both 'msm-service' and 'replication-service' are listed.

Just to be clear - If you remove/disable the flush/invalidation agent then you do not observe this kind of auto-replication behavior? You could validate same with above mentioned step of checking publish agent queue/logs.

    • Editing a component
      • Dispatcher flush agent
        • No items on queue due to the agent being disabled
      • Publish agent
        • NO item is being added to the queue
    • Publishing a page
      • Dispatcher flush agent
        • No items on queue due to the agent being disabled
      • Publish agent
        • publishing2-publish-agent.png

It seems that when the dispatcher flush agent is disabled I indeed don't see the auto-replication behaviour on modification only when I actually publish the page the corresponding item will be added to the publish agent queue.

Could you remove 'pushonmodify' from live copies and do a manual rollout from source page and check the behavior if automated publish still happens?

When I replace my LiveCopy rollout configuration:

/etc/msm/rolloutconfigs/activate

/etc/msm/rolloutconfigs/deactivate

/etc/msm/rolloutconfigs/pushonmodify

BY:

/etc/msm/rolloutconfigs/activate

/etc/msm/rolloutconfigs/deactivate

/etc/msm/rolloutconfigs/default

And trigger a manual rollout using http://localhost:4502/etc/blueprints.html , the auto-replication behaviour is NOT happening anymore. We benefit a lot from the '/etc/msm/rolloutconfigs/pushonmodify' config, we are aware of the potential negative performance impact.

Avatar

Level 10

Editing a component -

pushonmodify rollout pushed the content changes from source to live copies and then generated 'onmodify' event which was captured by flush agent and hence you see two invalidation requests with 'replication-service' triggered by flush agent.

I'm not sure why 'webservice-support-replication' got triggered for the child pages. Here's the explanation & fix for same [1]. Now 'msm-service' publish and invalidation request got triggered from the 'onmodify' trigger that was launched by 'pushonmodify' rollout. I believe this is what you mentioned as the problem statement. I'm not sure if this is a bug or expected functionality.

On publish-

Since admin user triggered the publish hence the first row is fine. In this case 'webservice-support-replication' [1] still got triggered. The rows with 'msm-service' are fine as it was a publish + invalidation request.

[1] - Flush replications are triggered by webservice-support-replication user | AEM 6.x

Per my knowledge, the common listener for publish agent and invalidation agent doesn't distinguish the action to be taken if the filter agent is null (which is the case here for onmodify event). Probably Jörg Hoh can help here..

The stack trace for 'Thread-1349' might be waiting on the agent itself because the queue was blocked.

Avatar

Level 5

"Now 'msm-service' publish and invalidation request got triggered from the 'onmodify' trigger that was launched by 'pushonmodify' rollout. I believe this is what you mentioned as the problem statement. I'm not sure if this is a bug or expected functionality." Indeed I am not sure as well :S

Avatar

Employee Advisor

Strange. I see that the thread "ReplicateOnModification Processor" performs a replication with options and a filter set. That looks good. On the other hand, in the line 14 a replication is setup without any filter (as you mentioned), but this is done by "thread-1352". In AEM and Sling I would expect that the product uses threadpools to manage such threads, a unmanaged thread looks a bit suspicious to me. Can you try to repeat your scenario and do threaddumps to get a full stacktrace? That can gives us further indication about the root cause of it.

Avatar

Level 5

When I navigate to http://localhost:4502/system/console/status-jstack-threaddump and repeat the scenario and look for 'Thread-1349' I find the following:

"Thread-1349" #3948 daemon prio=1 os_prio=31 tid=0x00007fa7c8aeb000 nid=0x16107 waiting on condition [0x00007000112a1000]

   java.lang.Thread.State: WAITING (parking)

at sun.misc.Unsafe.park(Native Method)

- parking to wait for  <0x000000074d6967e8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)

at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)

at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

I am not sure if this is of any use.

Avatar

Employee Advisor

That's the stacktrace of a unused thread in a threadpool. For me it looks like that this threadpool is not created using sling mechanics, otherwise the threadname would be more meaningful (something like "pool-20-thread-7") or even speaking names ("oak-observation-1"). Are you sure that this thread is the one in question?

On the other hand side I see a "thread-25" here on my local AEM 6.4 instance which has the same stacktrace. That means, it can still be part of the product. Let me investigate on that.

Avatar

Level 5

I investigated the empty ReplicationOptions thread using VisualVM:

[Thread-3788] com.day.cq.replication.impl.ReplicatorImpl Setting up replication with options:

ReplicationOptions{synchronous=false, revision='null', suppressStatusUpdate=false, suppressVersions=false, filter=null, aggregateHandler=null}

1674304_pastedImage_2.png

I see no custom code interfering, it seems like it's all part of the product.

Avatar

Employee Advisor

Your dump indicates to me, that the MSM rollout invokes the TargetActivateActionFactory, which invokes replication. And this replication might trigger all agents. But that seems no match to your initial problem description, because there I read it in this way: The invalidation agent (with on-modification trigger set) causes the "standard" replication to get fired.

Avatar

Level 5

Yes, when I enable it, it does trigger the other default replication for some reason. So it still seems like a bug to me.

Avatar

Level 10

An old thread but this might help-

Replicating and controlling live copy replication

In this use case, the live copy had rollout config 'Activate on blueprint activation'..