Highlighted

Problem with cache "Cache has closed down no further action is allowed"

Avatar

Avatar

diogomiranda

Avatar

diogomiranda

diogomiranda

07-08-2018

Hello,

I don't know if I'm asking this in the correct section but here it goes,

Cyclically, after a few days the server begins to give cache problems, the AEM Forms stops working, requiring a restart.

[6/4/18 12:12:06:456 WEST] 0000120f DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11527262566011) is still active but was recovered by another instance in the cluster.  This may cause inconsistent behavior.

[6/4/18 12:12:06:528 WEST] 0000005d ApplicationMo W   DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv01: Did not receive adequate CPU time slice. Last known CPU usage time at 12:11:08:593 WEST. Inactivity duration was 27 seconds.

[6/4/18 12:12:07:752 WEST] 0000120a UMCacheManage A   Cache has closed down no further action is allowed

[6/4/18 12:12:07:753 WEST] 0000120a UMCacheManage W   Exception caught while dealing with cache : Action - Get, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE

[6/4/18 12:12:07:771 WEST] 0000120a UMCacheManage A   Cache has closed down no further action is allowed

[6/4/18 12:12:07:771 WEST] 0000120a UMCacheManage W   Exception caught while dealing with cache : Action - Get, ObjectType - UM_ASSERTION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Local:UM_ASSERTION_CACHE

[6/4/18 12:12:07:780 WEST] 0000120a Reference     I org.apache.xml.security.signature.Reference verify Verification successful for URI "#cd7fe19fd72faaf5ca1291759053efe4"

[6/4/18 12:12:08:968 WEST] 0000120a ServiceRegist W   Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:09:601 WEST] 0000120a ServiceRegist W   Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:09:602 WEST] 0000120a ServiceRegist W   Cache get failed for service configuration provider.file_scan_service.1.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:09:951 WEST] 0000120a ServiceRegist W   Cache put failed for service configuration provider.file_scan_service.1.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:09:952 WEST] 0000120a ServiceRegist W   Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:008 WEST] 0000120a ServiceRegist W   Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:009 WEST] 0000120a ServiceFactor E   Error accessing the cache container - Error on GET action for cache Local:SERVICE_FACTORY_CACHE

[6/4/18 12:12:10:009 WEST] 0000120a ServiceFactor E   Error accessing the cache container - Error on PUT action for cache Local:SERVICE_FACTORY_CACHE

[6/4/18 12:12:10:013 WEST] 0000120a POJOCache     W   POJO factory cache lookup failed for service class.Error accessing the cache container - Error on GET action for cache Local:POJO_INVOKER_CACHE

[6/4/18 12:12:10:013 WEST] 0000120a POJOCache     W   POJO Invoker cache creation failed.Error accessing the cache container - Error on PUT action for cache Local:POJO_INVOKER_CACHE

[6/4/18 12:12:10:013 WEST] 0000120a ServiceRegist W   Cache get failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:014 WEST] 0000120a ServiceRegist W   Cache put failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:014 WEST] 0000120a ServiceRegist W   Cache get failed for service configuration system.endpoint_registry.0.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:10:018 WEST] 0000120a ServiceRegist W   Cache put failed for service configuration system.endpoint_registry.0.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:10:018 WEST] 0000120a ServiceRegist W   Cache get failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:019 WEST] 0000120a ServiceRegist W   Cache put failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

Do any of you know what can cause this problem? Is there a configuration that we can change so this does not happen?

Any help would be appreciated.

Thank you!

Replies

Highlighted

Avatar

Avatar

smacdonald2008

Total Posts

12.7K

Likes

1.4K

Correct Answer

2.3K

Avatar

smacdonald2008

Total Posts

12.7K

Likes

1.4K

Correct Answer

2.3K
smacdonald2008

07-08-2018

Arer you using Dispatcher for Caching?

Highlighted

Avatar

Avatar

Mayank_Gandhi

Employee

Avatar

Mayank_Gandhi

Employee

Mayank_Gandhi
Employee

07-08-2018

Hi,

We would need your complete infra details i.e app server, cluster, topology, db , OS etc.

Highlighted

Avatar

Avatar

DarrenBiz

Avatar

DarrenBiz

DarrenBiz

07-08-2018

You could be running into CPU starvation. The error message DCSV0004W relates to this [1]. You could also potentially see the error "HMGR0152W: CPU Starvation detected" in the logs as well?

Check your CPU utilisation and see if any processes are causing 100% utilisation. This could be caused by badly tuned watched folders attempting too many concurrent processes, but there are many other causes e.g.

[1] Help! My CPU is starved. What do I feed it? - IBM Business Process Management Products Support Blog ...

[2] https://www-01.ibm.com/support/docview.wss?uid=swg27042626&aid=1

Highlighted

Avatar

Avatar

diogomiranda

Avatar

diogomiranda

diogomiranda

10-08-2018

app server:

1546128_pastedImage_3.png

cluster, topology

2 application servers, 1 cluster

1546127_pastedImage_2.png


db:

DB2 V11.1

OS:

AIX 7200-02-01-1731

Highlighted

Avatar

Avatar

diogomiranda

Avatar

diogomiranda

diogomiranda

10-08-2018

@DarrenBiz

I never get that error "HMGR0152W: CPU Starvation detected", the pattern is always the same, one of application servers gives this 3 erros and then all the cache acess returns errors, the other application server still works.

It even starts to happen during the nigth when no one is using the platform...

1)

[8/10/18 13:26:12:207 WEST] 00001218 DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11533660400448) is still active but was recovered by another instance in the cluster.  This may cause inconsistent behavior.

2)

[8/10/18 13:26:12:380 WEST] 0000005b ApplicationMo W   DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv01: Did not receive adequate CPU time slice. Last known CPU usage time at 13:25:23:543 WEST. Inactivity duration was 18 seconds.

3)

[8/10/18 13:26:15:690 WEST] 000011a4 UMCacheManage A   Cache has closed down no further action is allowed

[8/10/18 13:26:15:691 WEST] 000011a4 UMCacheManage W   Exception caught while dealing with cache : Action - Put, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE

[8/10/18 13:26:16:312 WEST] 00003863 ServiceRegist W   Cache get failed for service DesignTimeServiceRegistry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

etc...

Other example at 3AM!

[8/5/18 3:26:00:033 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:034 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:036 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:037 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:040 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:043 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:045 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:056 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:107 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:108 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:109 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:109 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:124 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:139 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:140 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:140 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:140 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:141 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:142 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:33:06:228 WEST] 0000005a ApplicationMo W   DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv02: Did not receive adequate CPU time slice. Last known CPU usage time at 03:32:13:527 WEST. Inactivity duration was 22 seconds.

[8/5/18 3:33:13:391 WEST] 00001212 DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11532097190491) is still active but was recovered by another instance in the cluster.  This may cause inconsistent behavior.

[8/5/18 3:33:06:185 WEST] 0000005b CoordinatorCo W   HMGR0152W: A thread scheduling delay of 32 seconds was detected.

[8/5/18 3:33:16:124 WEST] 000011a4 UMCacheManage A   Cache has closed down no further action is allowed

[8/5/18 3:33:16:174 WEST] 000011a4 UMCacheManage W   Exception caught while dealing with cache : Action - Get, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE

[8/5/18 3:33:17:132 WEST] 00001210 Reference     I org.apache.xml.security.signature.Reference verify Verification successful for URI "#e53f9977ea98501cadf01ac8a0e487d6"

[8/5/18 3:33:18:628 WEST] 00001210 ServiceRegist W   Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[8/5/18 3:33:19:703 WEST] 00001210 ServiceRegist W   Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[8/5/18 3:33:19:703 WEST] 00001210 Service

What you mean by badly tunned watched folders?

I have three, two listening every 5 seconds and one every 10 seconds

Highlighted

Avatar

Avatar

diogomiranda

Avatar

diogomiranda

diogomiranda

10-08-2018

@smacdonald2008

Sorry, didn't understood what you mean with dispatcher for cache. At the moment we only use AEM Forms (Previously we used adobe livecycle 3 and we are migrating to AEM 6.3 forms)

We have installed AEM Forms and only imported the applications (workbench) that we had in Adobe Livecycle, no further change or development has been done using new functionalities.

If the product does that by default then yes, as we didn't change anything regarding cache.

Highlighted

Avatar

Avatar

DarrenBiz

Avatar

DarrenBiz

DarrenBiz

20-08-2018

Watched folders have a bunch of settings that can affect the way that they process contents and how much performance you extract from the system without starving it of CPU cycles. This mostly affects watched folders that are used for batched processes, but ad-hoc can benefit from tuning as well. When a watched folder executes to find files, it scans the whole watched folder directory. This scan takes up processing overhead, especially when there are lots of files in the folder, so you want to make sure you have finished processing that last batch of documents before scanning and picking up a batch of new documents to process. Thottling [1] can also help with this as it will not do another scan for files until the last batch has finished. You have a clustered instance so these jobs will be sent between the instances by the Quartz scheduler so you should get a lot more overhead here. You can also get some benefits from tuning the Scheduler service thread pool if you find yourself running into issues between the other services in the system [2]

Its hard to say what your exact problem is without looking into other things like other VMs running on your underlying infrastructure and if that is affecting anything. There might be another VM taking resources away from a cluster member causing it to be removed from the cluster.

[1] LiveCycle ES4 * About throttling

[2] LiveCycle ES4 * Performance and scalability

Highlighted

Avatar

Avatar

Mayank_Gandhi

Employee

Avatar

Mayank_Gandhi

Employee

Mayank_Gandhi
Employee

22-08-2018

Apart from the suggestions shared by Darren can you confirm the time difference between the two instances and if there is any Antivirus scanning the server folders or if you run any backup utility( for ex. veritas netbackup) on the servers?

Highlighted

Avatar

Avatar

diogomiranda

Avatar

diogomiranda

diogomiranda

23-08-2018

I think it is maybe the time difference, in production we have two clusters with one application server each.

we don't any adobe process in production yet and it is also giving similar errors.

just checked 20 seconds diff between them

waiting untill this problem happens again in this environment to check if it is or not.