Expand my Community achievements bar.

Problem with cache "Cache has closed down no further action is allowed"

Avatar

Level 4

Hello,

I don't know if I'm asking this in the correct section but here it goes,

Cyclically, after a few days the server begins to give cache problems, the AEM Forms stops working, requiring a restart.

[6/4/18 12:12:06:456 WEST] 0000120f DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11527262566011) is still active but was recovered by another instance in the cluster.  This may cause inconsistent behavior.

[6/4/18 12:12:06:528 WEST] 0000005d ApplicationMo W   DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv01: Did not receive adequate CPU time slice. Last known CPU usage time at 12:11:08:593 WEST. Inactivity duration was 27 seconds.

[6/4/18 12:12:07:752 WEST] 0000120a UMCacheManage A   Cache has closed down no further action is allowed

[6/4/18 12:12:07:753 WEST] 0000120a UMCacheManage W   Exception caught while dealing with cache : Action - Get, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE

[6/4/18 12:12:07:771 WEST] 0000120a UMCacheManage A   Cache has closed down no further action is allowed

[6/4/18 12:12:07:771 WEST] 0000120a UMCacheManage W   Exception caught while dealing with cache : Action - Get, ObjectType - UM_ASSERTION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Local:UM_ASSERTION_CACHE

[6/4/18 12:12:07:780 WEST] 0000120a Reference     I org.apache.xml.security.signature.Reference verify Verification successful for URI "#cd7fe19fd72faaf5ca1291759053efe4"

[6/4/18 12:12:08:968 WEST] 0000120a ServiceRegist W   Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:09:601 WEST] 0000120a ServiceRegist W   Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:09:602 WEST] 0000120a ServiceRegist W   Cache get failed for service configuration provider.file_scan_service.1.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:09:951 WEST] 0000120a ServiceRegist W   Cache put failed for service configuration provider.file_scan_service.1.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:09:952 WEST] 0000120a ServiceRegist W   Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:008 WEST] 0000120a ServiceRegist W   Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:009 WEST] 0000120a ServiceFactor E   Error accessing the cache container - Error on GET action for cache Local:SERVICE_FACTORY_CACHE

[6/4/18 12:12:10:009 WEST] 0000120a ServiceFactor E   Error accessing the cache container - Error on PUT action for cache Local:SERVICE_FACTORY_CACHE

[6/4/18 12:12:10:013 WEST] 0000120a POJOCache     W   POJO factory cache lookup failed for service class.Error accessing the cache container - Error on GET action for cache Local:POJO_INVOKER_CACHE

[6/4/18 12:12:10:013 WEST] 0000120a POJOCache     W   POJO Invoker cache creation failed.Error accessing the cache container - Error on PUT action for cache Local:POJO_INVOKER_CACHE

[6/4/18 12:12:10:013 WEST] 0000120a ServiceRegist W   Cache get failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:014 WEST] 0000120a ServiceRegist W   Cache put failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:014 WEST] 0000120a ServiceRegist W   Cache get failed for service configuration system.endpoint_registry.0.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:10:018 WEST] 0000120a ServiceRegist W   Cache put failed for service configuration system.endpoint_registry.0.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration

[6/4/18 12:12:10:018 WEST] 0000120a ServiceRegist W   Cache get failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[6/4/18 12:12:10:019 WEST] 0000120a ServiceRegist W   Cache put failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

Do any of you know what can cause this problem? Is there a configuration that we can change so this does not happen?

Any help would be appreciated.

Thank you!

21 Replies

Avatar

Level 4

@smacdonald2008

Sorry, didn't understood what you mean with dispatcher for cache. At the moment we only use AEM Forms (Previously we used adobe livecycle 3 and we are migrating to AEM 6.3 forms)

We have installed AEM Forms and only imported the applications (workbench) that we had in Adobe Livecycle, no further change or development has been done using new functionalities.

If the product does that by default then yes, as we didn't change anything regarding cache.

Avatar

Employee Advisor

Apart from the suggestions shared by Darren can you confirm the time difference between the two instances and if there is any Antivirus scanning the server folders or if you run any backup utility( for ex. veritas netbackup) on the servers?

Avatar

Level 4

I think it is maybe the time difference, in production we have two clusters with one application server each.

we don't any adobe process in production yet and it is also giving similar errors.

just checked 20 seconds diff between them

waiting untill this problem happens again in this environment to check if it is or not.

Avatar

Employee Advisor

20 seconds is a big difference,. For a cluster to operate smoothly, it is essential that the clocks on all of the cluster nodes be closely synchronized. This cannot be done adequately by hand and has to be done by some form of time-sync service that runs very regularly. The clocks on all nodes must be within a second of each other. Best practice dictates that not only the cluster nodes, but also the load balancer, database server, GDS NAS server, and any other components all be synchronized as well.

Avatar

Level 4

by the way what is the best way to check a node time?

I managed to find the difference in PRODUCTION because it's a big difference. but in small differences would be hard...

QUALIFICATION is OK for now, waiting for the errors to return to check if thats the cause.

Avatar

Employee

you could use from a command prompt: c:\echo %TIME%, for example on both. it will show that instant in time.

Avatar

Level 4

thank you, thats usefull for PRODUCTION where we have two different physical servers, that would only give me the time of the OS itself, I already found that there is a 20 seconds difference in PROD, so that needs to be fixed.

but for now I am trying to figure out QUALIFICATION

I have only 2 application servers, 1 cluster (so , could the two application servers deviate from one another even considering they are runnig in the same machine? and if so how could I check the time individually of each one?)

Someone told me that they once had that problem, the application servers stopped syncronizing with the operating system. So I am wondering if that could be the case

Avatar

Employee Advisor

Hi,

We would need your complete infra details i.e app server, cluster, topology, db , OS etc.

Avatar

Level 4

app server:

1546128_pastedImage_3.png

cluster, topology

2 application servers, 1 cluster

1546127_pastedImage_2.png


db:

DB2 V11.1

OS:

AIX 7200-02-01-1731

Avatar

Employee

Are you using AEM Forms 6.4? It's not clear what your version of AEM Forms is. I would have a look at the gemfire.logs, since its the gemfire cache that is shutting down. As Mayank points out, 20s is far too long a delay between nodes. the default timeout is 5000ms.

Avatar

Level 4

AEM forms 6.3

can you help me figure out where that logs are?

I don't have full acess to the server unfortunatly.

Avatar

Employee

The gemfire logs are located in the AEM Forms temp folder:

Linux: %AEMForms_TEMP%/adobejb_server1/Caching/Gemfire.log

Windows: %AEMForms_TEMP%\adobejb_server1\Caching/Gemfire.log

Where %AEMForms_TEMP% is the location you configured the AEM Forms temp folder in LCM..

Avatar

Level 7

You could be running into CPU starvation. The error message DCSV0004W relates to this [1]. You could also potentially see the error "HMGR0152W: CPU Starvation detected" in the logs as well?

Check your CPU utilisation and see if any processes are causing 100% utilisation. This could be caused by badly tuned watched folders attempting too many concurrent processes, but there are many other causes e.g.

[1] Help! My CPU is starved. What do I feed it? - IBM Business Process Management Products Support Blog ...

[2] https://www-01.ibm.com/support/docview.wss?uid=swg27042626&aid=1

Avatar

Level 4

@DarrenBiz

I never get that error "HMGR0152W: CPU Starvation detected", the pattern is always the same, one of application servers gives this 3 erros and then all the cache acess returns errors, the other application server still works.

It even starts to happen during the nigth when no one is using the platform...

1)

[8/10/18 13:26:12:207 WEST] 00001218 DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11533660400448) is still active but was recovered by another instance in the cluster.  This may cause inconsistent behavior.

2)

[8/10/18 13:26:12:380 WEST] 0000005b ApplicationMo W   DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv01: Did not receive adequate CPU time slice. Last known CPU usage time at 13:25:23:543 WEST. Inactivity duration was 18 seconds.

3)

[8/10/18 13:26:15:690 WEST] 000011a4 UMCacheManage A   Cache has closed down no further action is allowed

[8/10/18 13:26:15:691 WEST] 000011a4 UMCacheManage W   Exception caught while dealing with cache : Action - Put, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE

[8/10/18 13:26:16:312 WEST] 00003863 ServiceRegist W   Cache get failed for service DesignTimeServiceRegistry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

etc...

Other example at 3AM!

[8/5/18 3:26:00:033 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:034 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:036 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:037 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:040 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:043 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:045 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:056 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:107 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:108 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:109 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:109 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:124 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:139 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:140 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:140 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:140 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:141 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:26:00:142 WEST] 000011a2 NamingHelper  I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}

[8/5/18 3:33:06:228 WEST] 0000005a ApplicationMo W   DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv02: Did not receive adequate CPU time slice. Last known CPU usage time at 03:32:13:527 WEST. Inactivity duration was 22 seconds.

[8/5/18 3:33:13:391 WEST] 00001212 DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11532097190491) is still active but was recovered by another instance in the cluster.  This may cause inconsistent behavior.

[8/5/18 3:33:06:185 WEST] 0000005b CoordinatorCo W   HMGR0152W: A thread scheduling delay of 32 seconds was detected.

[8/5/18 3:33:16:124 WEST] 000011a4 UMCacheManage A   Cache has closed down no further action is allowed

[8/5/18 3:33:16:174 WEST] 000011a4 UMCacheManage W   Exception caught while dealing with cache : Action - Get, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE

[8/5/18 3:33:17:132 WEST] 00001210 Reference     I org.apache.xml.security.signature.Reference verify Verification successful for URI "#e53f9977ea98501cadf01ac8a0e487d6"

[8/5/18 3:33:18:628 WEST] 00001210 ServiceRegist W   Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[8/5/18 3:33:19:703 WEST] 00001210 ServiceRegist W   Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service

[8/5/18 3:33:19:703 WEST] 00001210 Service

What you mean by badly tunned watched folders?

I have three, two listening every 5 seconds and one every 10 seconds

Avatar

Level 7

Watched folders have a bunch of settings that can affect the way that they process contents and how much performance you extract from the system without starving it of CPU cycles. This mostly affects watched folders that are used for batched processes, but ad-hoc can benefit from tuning as well. When a watched folder executes to find files, it scans the whole watched folder directory. This scan takes up processing overhead, especially when there are lots of files in the folder, so you want to make sure you have finished processing that last batch of documents before scanning and picking up a batch of new documents to process. Thottling [1] can also help with this as it will not do another scan for files until the last batch has finished. You have a clustered instance so these jobs will be sent between the instances by the Quartz scheduler so you should get a lot more overhead here. You can also get some benefits from tuning the Scheduler service thread pool if you find yourself running into issues between the other services in the system [2]

Its hard to say what your exact problem is without looking into other things like other VMs running on your underlying infrastructure and if that is affecting anything. There might be another VM taking resources away from a cluster member causing it to be removed from the cluster.

[1] LiveCycle ES4 * About throttling

[2] LiveCycle ES4 * Performance and scalability

Avatar

Employee

The gemfire logs show cache time offset as being too long and it shuts down:

[info 2018/09/04 15:45:25.546 WEST GC_UzmE <IDPSchedulerService_Worker-5> tid=0x120a] Initialization of region Replicated_FormsClusterCache completed

[warning 2018/09/05 05:09:12.703 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 207 ms from earlier offset.

[warning 2018/09/05 11:37:35.186 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 659 ms from earlier offset.

[info 2018/09/06 16:24:35.992 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] The negotiated cache time from shaemaq1(GC_UzmE:21103040)<v0>:35873 is 2018/09/06 16:24:41.357 WEST.  Delta from the system clock is 5,656 milliseconds.

[warning 2018/09/06 16:26:16.174 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 5,656 ms from earlier offset.

[info 2018/09/06 17:52:19.546 WEST GC_UzmE <WebContainer : 8> tid=0x296f] Created disk store Global_DeployLockMonitorCache with unique id e48ef1f85c184519-94bb514ce53e7729

[info 2018/09/06 17:52:19.583 WEST GC_UzmE <WebContainer : 8> tid=0x296f] recovery region initialization took 1 ms

[info 2018/09/06 17:52:19.631 WEST GC_UzmE <WebContainer : 8> tid=0x296f] Initializing region Global_DeployLockMonitorCache

[info 2018/09/06 17:52:19.675 WEST GC_UzmE <WebContainer : 8> tid=0x296f] Initialization of region Global_DeployLockMonitorCache completed

[info 2018/09/07 15:11:27.081 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] The negotiated cache time from shaemaq1(GC_UzmE:21103040)<v0>:35873 is 2018/09/07 15:11:27.995 WEST.  Delta from the system clock is 1,060 milliseconds.

[warning 2018/09/07 15:13:07.224 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 1,060 ms from earlier offset.

[warning 2018/09/10 11:33:20.203 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 346 ms from earlier offset.

[info 2018/09/10 17:02:28.485 WEST GC_UzmE <UDP mcast receiver> tid=0x11b3] Membership: received new view  [shaemaq1(GC_6T7h:17170734)<v1>:16233|12] [shaemaq1(GC_6T7h:17170734)<v1>:16233/63465] crashed mbrs: [shaemaq1(GC_UzmE:21103040)<v0>:35873/14545]

[info 2018/09/10 17:02:28.903 WEST GC_UzmE <CloserThread> tid=0x13f45] CacheServer configuration saved

[severe 2018/09/10 17:02:28.906 WEST GC_UzmE <CloserThread> tid=0x13f45] Membership service failure: Channel closed: com.gemstone.gemfire.ForcedDisconnectException: This member has been forced out of the distributed system by shaemaq1(GC_6T7h:17170734)<v1>:16233.  Please consult GemFire logs to find the reason. (GMS shun)

[info 2018/09/10 17:02:29.191 WEST GC_UzmE <ReconnectThread> tid=0x13f45] GemFireCache[id = 2139782586; isClosing = true; isShutDownAll = false; closingGatewayHubsByShutdownAll = false; created = Mon Sep 03 17:42:19 WEST 2018; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

Avatar

Level 4

I've added this two options:

-Dgemfire.member-timeout=15000 -Xgcpolicy:gencon

hopefully it will solve the problem, let's wait to see.

thanks to all.

https://helpx.adobe.com/aem-forms/kb/Error-accessing-the-cache-container-Error-on-PUT-action-for-cac...

There are few options to resolve the issue:

  1. 1) Optimize GC collection on your system to make sure to complete within the Gemfire timeout. For the guidelines, refer to https://www.cubrid.org/blog/how-to-tune-java-garbage-collection.
  2. 2) Generate a Heap Dump when Full GC is executed and analyze which objects are being heavily used. Use it as a basis to optimize your workflow.
  3. 3) Increase the Gemfire Membership Timeout to a value high enough to allow the Full GC to complete. For example, the JVM argument below increases the timeout to 15 s which covers the situation found in the GC logs above. For more information, refer to https://pubs.vmware.com/vfabric5/index.jsp?topic=/com.vmware.vfabric.gemfire.6.6/reference/topics/ge...:

“-Dgemfire.member-timeout=15000”