Adobe Experience Manager Sites & More

Rob_Romito · 10/15/15

Hi Everyone. We're having some trouble with our prod CQ 5.6.1 SP1 environment related to login tokens.

We have a 1 author node / 2 publish node setup, TAR PM, no clustering. We have a caching dispatcher in front of each publish node and use a custom login module. Everything has been working great for a couple of years. The problem started occurring a couple of months ago and occurs almost daily at this point.

At arbitrary times during the day on one of the publish nodes, the repo will start to fill up. A new tar file will be written almost every minute. When I look at the change history, I can see a user under /home/users/[s]/[so]/[som]/[some random user] who will have hundreds of login tokens. Tokens will keep on being added until I restart the node. Sometimes after restarting the exploding node, the other publish node will start exploding with the same user.

Has anyone ever seen something like this before?

When I look in the logs, the only thing I see are some lines like this. I don't know if they're related to the problem or not. Nothing stood out on Google.

error.log.2015-08-31:31.08.2015 21:20:10.801 *WARN* [pool-5-thread-5] org.apache.jackrabbit.core.ItemSaveOperation /home/users/[users different from the exploding .tokens user]/.tokens: failed to restore transient state

I appreciate any guidance I can get on this one. Thanks.

Jörg_Hoh · 10/15/15

Hi,

when this situation occurs, can you create some threaddumps to get insight, what function is going crazy? Also: It's quite unlikely that application behaviour suddenly changes without any interaction from the outside. So what changes (even very minor ones) have you executed in the days before this phenomenon started to appear?

kind regards;
Jörg

View solution in original post

Jörg_Hoh · 10/15/15

Hi,

when this situation occurs, can you create some threaddumps to get insight, what function is going crazy? Also: It's quite unlikely that application behaviour suddenly changes without any interaction from the outside. So what changes (even very minor ones) have you executed in the days before this phenomenon started to appear?

kind regards;
Jörg

Rob_Romito · 10/15/15

Thanks for the tips. I'll run a a thread dump next time it happens and see if anything stands out.

I agree that something must have changed in our system. We do a lot of syncing with AD and I think somehow the problem lies in there.

I'm also in the middle of testing out 5.6.1 SP2. The release notes mention login tokens a couple of times. I'm *hoping* it will help out. SP 2 is supposed to have all of the previously released hotfixes, so we'll see.

sriv57192237 · 10/15/15

Hi,

What is the user volume now and month before - possible number of user nodes in publish?

Is there a possibility the users have increased over a period of time and AEM being a append only trying to save parent with many child nodes causing this issue - not sure but trying to get a different view as this being CRX2 (5.6.1)

Also look into thread dumps from these

1) any user node create / update flow

2) any user related workflows

Is there any change to actual user node which was created and it is not found subsequent flows. This warn comes from possibly below operation look like some user node inconsitency

https://svn.apache.org/repos/asf/jackrabbit/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbi...

3) How effective is tar optimization & also run a repository consistency check for the user nodes.

Also similar warn reported in jackrabbit for session save failure - so user node save failure could also be a possibility - not sure http://osdir.com/ml/users.jackrabbit.apache.org/2011-06/msg00096.html

Rob_Romito · 10/15/15

Hi Sri. Thanks for your feedback.

The number of user nodes is pretty consistent, but there is a lot of turn over. For most users that are added, a similar number of users will be deleted. So we have a lot of writes, but a small amount of growth.

I may have found the problem, but it's still very early so I'm hesitant to claim victory.

In addition to the exploding .tokens problem, we were also having a problem with reverse replication with one of our publish nodes. I did a bunch of research and ended up manually restarting the granite worfklow core bundle on the bad publish node. This fixed the reverse replication problem. I did this Wednesday afternoon and the .token explosion hasn't happened since.

We have a couple of workflows connected to user logins, so it's feasible that the .token explosion was a symptom of the underlying worfkflow issue.

I'm going to monitor the issue for a few days. I'll update the post then.