Level 3

Problem with cache "Cache has closed down no further action is allowed"

Forum|Forum|7 years ago
August 7, 2018
6 replies
17827 views

Hello,

I don't know if I'm asking this in the correct section but here it goes,

Cyclically, after a few days the server begins to give cache problems, the AEM Forms stops working, requiring a restart.

[6/4/18 12:12:06:456 WEST] 0000120f DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11527262566011) is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.

[6/4/18 12:12:06:528 WEST] 0000005d ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv01: Did not receive adequate CPU time slice. Last known CPU usage time at 12:11:08:593 WEST. Inactivity duration was 27 seconds.

[6/4/18 12:12:07:752 WEST] 0000120a UMCacheManage A Cache has closed down no further action is allowed

[6/4/18 12:12:07:753 WEST] 0000120a UMCacheManage W Exception caught while dealing with cache : Action - Get, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE

[6/4/18 12:12:07:771 WEST] 0000120a UMCacheManage A Cache has closed down no further action is allowed

[6/4/18 12:12:07:771 WEST] 0000120a UMCacheManage W Exception caught while dealing with cache : Action - Get, ObjectType - UM_ASSERTION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Local:UM_ASSERTION_CACHE

[6/4/18 12:12:07:780 WEST] 0000120a Reference I org.apache.xml.security.signature.Reference verify Verification successful for URI "#cd7fe19fd72faaf5ca1291759053efe4"

[6/4/18 12:12:08:968 WEST] 0000120a ServiceRegist W Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:09:601 WEST] 0000120a ServiceRegist W Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:09:602 WEST] 0000120a ServiceRegist W Cache get failed for service configuration provider.file_scan_service.1.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:09:951 WEST] 0000120a ServiceRegist W Cache put failed for service configuration provider.file_scan_service.1.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:09:952 WEST] 0000120a ServiceRegist W Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:008 WEST] 0000120a ServiceRegist W Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:009 WEST] 0000120a ServiceFactor E Error accessing the cache container - Error on GET action for cache Local:SERVICE_FACTORY_CACHE

[6/4/18 12:12:10:009 WEST] 0000120a ServiceFactor E Error accessing the cache container - Error on PUT action for cache Local:SERVICE_FACTORY_CACHE

[6/4/18 12:12:10:013 WEST] 0000120a POJOCache W POJO factory cache lookup failed for service class.Error accessing the cache container - Error on GET action for cache Local:POJO_INVOKER_CACHE

[6/4/18 12:12:10:013 WEST] 0000120a POJOCache W POJO Invoker cache creation failed.Error accessing the cache container - Error on PUT action for cache Local:POJO_INVOKER_CACHE

[6/4/18 12:12:10:013 WEST] 0000120a ServiceRegist W Cache get failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:014 WEST] 0000120a ServiceRegist W Cache put failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:014 WEST] 0000120a ServiceRegist W Cache get failed for service configuration system.endpoint_registry.0.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:10:018 WEST] 0000120a ServiceRegist W Cache put failed for service configuration system.endpoint_registry.0.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:10:018 WEST] 0000120a ServiceRegist W Cache get failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:019 WEST] 0000120a ServiceRegist W Cache put failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

Do any of you know what can cause this problem? Is there a configuration that we can change so this does not happen?

Any help would be appreciated.

Thank you!

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.

smacdonald2008

Level 10

Arer you using Dispatcher for Caching?

diogomirandaAuthor

Level 3

@@smacdonald2008

Sorry, didn't understood what you mean with dispatcher for cache. At the moment we only use AEM Forms (Previously we used adobe livecycle 3 and we are migrating to AEM 6.3 forms)

We have installed AEM Forms and only imported the applications (workbench) that we had in Adobe Livecycle, no further change or development has been done using new functionalities.

If the product does that by default then yes, as we didn't change anything regarding cache.

Mayank_Gandhi

Adobe Employee

Apart from the suggestions shared by Darren can you confirm the time difference between the two instances and if there is any Antivirus scanning the server folders or if you run any backup utility( for ex. veritas netbackup) on the servers?

Mayank_Gandhi

Adobe Employee

Hi,

We would need your complete infra details i.e app server, cluster, topology, db , OS etc.

diogomirandaAuthor

Level 3

app server:

cluster, topology

2 application servers, 1 cluster

db:

DB2 V11.1

OS:

AIX 7200-02-01-1731

K

kjaeggin

Adobe Employee

Are you using AEM Forms 6.4? It's not clear what your version of AEM Forms is. I would have a look at the gemfire.logs, since its the gemfire cache that is shutting down. As Mayank points out, 20s is far too long a delay between nodes. the default timeout is 5000ms.

DarrenBiz

Level 6

You could be running into CPU starvation. The error message DCSV0004W relates to this [1]. You could also potentially see the error "HMGR0152W: CPU Starvation detected" in the logs as well?

Check your CPU utilisation and see if any processes are causing 100% utilisation. This could be caused by badly tuned watched folders attempting too many concurrent processes, but there are many other causes e.g.

[1] Help! My CPU is starved. What do I feed it? - IBM Business Process Management Products Support Blog Blog

[2] https://www-01.ibm.com/support/docview.wss?uid=swg27042626&aid=1

diogomirandaAuthor

Level 3

@DarrenBiz

I never get that error "HMGR0152W: CPU Starvation detected", the pattern is always the same, one of application servers gives this 3 erros and then all the cache acess returns errors, the other application server still works.

It even starts to happen during the nigth when no one is using the platform...

1)

[8/10/18 13:26:12:207 WEST] 00001218 DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11533660400448) is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.

2)

[8/10/18 13:26:12:380 WEST] 0000005b ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv01: Did not receive adequate CPU time slice. Last known CPU usage time at 13:25:23:543 WEST. Inactivity duration was 18 seconds.

3)

[8/10/18 13:26:15:690 WEST] 000011a4 UMCacheManage A Cache has closed down no further action is allowed

[8/10/18 13:26:15:691 WEST] 000011a4 UMCacheManage W Exception caught while dealing with cache : Action - Put, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE

[8/10/18 13:26:16:312 WEST] 00003863 ServiceRegist W Cache get failed for service DesignTimeServiceRegistry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

etc...

Other example at 3AM!

[8/5/18 3:26:00:033 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:034 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:036 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:037 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:040 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:043 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:045 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:056 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:107 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:108 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:109 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:124 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:139 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:140 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:141 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:142 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:33:06:228 WEST] 0000005a ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv02: Did not receive adequate CPU time slice. Last known CPU usage time at 03:32:13:527 WEST. Inactivity duration was 22 seconds.

[8/5/18 3:33:13:391 WEST] 00001212 DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11532097190491) is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.

[8/5/18 3:33:06:185 WEST] 0000005b CoordinatorCo W HMGR0152W: A thread scheduling delay of 32 seconds was detected.

[8/5/18 3:33:16:124 WEST] 000011a4 UMCacheManage A Cache has closed down no further action is allowed

[8/5/18 3:33:16:174 WEST] 000011a4 UMCacheManage W Exception caught while dealing with cache : Action - Get, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE

[8/5/18 3:33:17:132 WEST] 00001210 Reference I org.apache.xml.security.signature.Reference verify Verification successful for URI "#e53f9977ea98501cadf01ac8a0e487d6"

[8/5/18 3:33:18:628 WEST] 00001210 ServiceRegist W Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[8/5/18 3:33:19:703 WEST] 00001210 ServiceRegist W Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[8/5/18 3:33:19:703 WEST] 00001210 Service

What you mean by badly tunned watched folders?

I have three, two listening every 5 seconds and one every 10 seconds

DarrenBiz

Level 6

Watched folders have a bunch of settings that can affect the way that they process contents and how much performance you extract from the system without starving it of CPU cycles. This mostly affects watched folders that are used for batched processes, but ad-hoc can benefit from tuning as well. When a watched folder executes to find files, it scans the whole watched folder directory. This scan takes up processing overhead, especially when there are lots of files in the folder, so you want to make sure you have finished processing that last batch of documents before scanning and picking up a batch of new documents to process. Thottling [1] can also help with this as it will not do another scan for files until the last batch has finished. You have a clustered instance so these jobs will be sent between the instances by the Quartz scheduler so you should get a lot more overhead here. You can also get some benefits from tuning the Scheduler service thread pool if you find yourself running into issues between the other services in the system [2]

Its hard to say what your exact problem is without looking into other things like other VMs running on your underlying infrastructure and if that is affecting anything. There might be another VM taking resources away from a cluster member causing it to be removed from the cluster.

[1] LiveCycle ES4 * About throttling

[2] LiveCycle ES4 * Performance and scalability

diogomirandaAuthor

Level 3

Hi again,

I attach the GemFire logs.

Any idea?

https://drive.google.com/file/d/1iQHfjt8KMpr8hmHD199AqTVFt4vGhVA9/view?usp=sharing

https://drive.google.com/file/d/1RIpmGmJoqanrqWfpFy9WTUy4cifTaVF-/view?usp=sharing

Thank you!

K

kjaeggin

Adobe Employee

The gemfire logs show cache time offset as being too long and it shuts down:

[info 2018/09/04 15:45:25.546 WEST GC_UzmE <IDPSchedulerService_Worker-5> tid=0x120a] Initialization of region Replicated_FormsClusterCache completed

[warning 2018/09/05 05:09:12.703 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 207 ms from earlier offset.

[warning 2018/09/05 11:37:35.186 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 659 ms from earlier offset.

[info 2018/09/06 16:24:35.992 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] The negotiated cache time from shaemaq1(GC_UzmE:21103040)<v0>:35873 is 2018/09/06 16:24:41.357 WEST. Delta from the system clock is 5,656 milliseconds.

[warning 2018/09/06 16:26:16.174 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 5,656 ms from earlier offset.

[info 2018/09/06 17:52:19.546 WEST GC_UzmE <WebContainer : 8> tid=0x296f] Created disk store Global_DeployLockMonitorCache with unique id e48ef1f85c184519-94bb514ce53e7729

[info 2018/09/06 17:52:19.583 WEST GC_UzmE <WebContainer : 8> tid=0x296f] recovery region initialization took 1 ms

[info 2018/09/06 17:52:19.631 WEST GC_UzmE <WebContainer : 8> tid=0x296f] Initializing region Global_DeployLockMonitorCache

[info 2018/09/06 17:52:19.675 WEST GC_UzmE <WebContainer : 8> tid=0x296f] Initialization of region Global_DeployLockMonitorCache completed

[info 2018/09/07 15:11:27.081 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] The negotiated cache time from shaemaq1(GC_UzmE:21103040)<v0>:35873 is 2018/09/07 15:11:27.995 WEST. Delta from the system clock is 1,060 milliseconds.

[warning 2018/09/07 15:13:07.224 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 1,060 ms from earlier offset.

[warning 2018/09/10 11:33:20.203 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 346 ms from earlier offset.

[info 2018/09/10 17:02:28.485 WEST GC_UzmE <UDP mcast receiver> tid=0x11b3] Membership: received new view [shaemaq1(GC_6T7h:17170734)<v1>:16233|12] [shaemaq1(GC_6T7h:17170734)<v1>:16233/63465] crashed mbrs: [shaemaq1(GC_UzmE:21103040)<v0>:35873/14545]

[info 2018/09/10 17:02:28.903 WEST GC_UzmE <CloserThread> tid=0x13f45] CacheServer configuration saved

[severe 2018/09/10 17:02:28.906 WEST GC_UzmE <CloserThread> tid=0x13f45] Membership service failure: Channel closed: com.gemstone.gemfire.ForcedDisconnectException: This member has been forced out of the distributed system by shaemaq1(GC_6T7h:17170734)<v1>:16233. Please consult GemFire logs to find the reason. (GMS shun)

[info 2018/09/10 17:02:29.191 WEST GC_UzmE <ReconnectThread> tid=0x13f45] GemFireCache[id = 2139782586; isClosing = true; isShutDownAll = false; closingGatewayHubsByShutdownAll = false; created = Mon Sep 03 17:42:19 WEST 2018; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

diogomirandaAuthor

Level 3

I've added this two options:

-Dgemfire.member-timeout=15000 -Xgcpolicy:gencon

hopefully it will solve the problem, let's wait to see.

thanks to all.

https://helpx.adobe.com/aem-forms/kb/Error-accessing-the-cache-container-Error-on-PUT-action-for-cache-Local-SERVICE-FACTORY-CACHE-causing-cluster-to-crash.html

There are few options to resolve the issue:

1) Optimize GC collection on your system to make sure to complete within the Gemfire timeout. For the guidelines, refer to https://www.cubrid.org/blog/how-to-tune-java-garbage-collection.
2) Generate a Heap Dump when Full GC is executed and analyze which objects are being heavily used. Use it as a basis to optimize your workflow.
3) Increase the Gemfire Membership Timeout to a value high enough to allow the Full GC to complete. For example, the JVM argument below increases the timeout to 15 s which covers the situation found in the GC logs above. For more information, refer to https://pubs.vmware.com/vfabric5/index.jsp?topic=/com.vmware.vfabric.gemfire.6.6/reference/topics/gemfire_properties.html:

“-Dgemfire.member-timeout=15000”

diogomirandaAuthor

Level 3

I confirm this problem is solved by adding

-Dgemfire.member-timeout=15000

-Xgcpolicy:gencon

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded