Hello,
I don't know if I'm asking this in the correct section but here it goes,
Cyclically, after a few days the server begins to give cache problems, the AEM Forms stops working, requiring a restart.
[6/4/18 12:12:06:456 WEST] 0000120f DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11527262566011) is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.
[6/4/18 12:12:06:528 WEST] 0000005d ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv01: Did not receive adequate CPU time slice. Last known CPU usage time at 12:11:08:593 WEST. Inactivity duration was 27 seconds.
[6/4/18 12:12:07:752 WEST] 0000120a UMCacheManage A Cache has closed down no further action is allowed
[6/4/18 12:12:07:753 WEST] 0000120a UMCacheManage W Exception caught while dealing with cache : Action - Get, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE
[6/4/18 12:12:07:771 WEST] 0000120a UMCacheManage A Cache has closed down no further action is allowed
[6/4/18 12:12:07:771 WEST] 0000120a UMCacheManage W Exception caught while dealing with cache : Action - Get, ObjectType - UM_ASSERTION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Local:UM_ASSERTION_CACHE
[6/4/18 12:12:07:780 WEST] 0000120a Reference I org.apache.xml.security.signature.Reference verify Verification successful for URI "#cd7fe19fd72faaf5ca1291759053efe4"
[6/4/18 12:12:08:968 WEST] 0000120a ServiceRegist W Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
[6/4/18 12:12:09:601 WEST] 0000120a ServiceRegist W Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
[6/4/18 12:12:09:602 WEST] 0000120a ServiceRegist W Cache get failed for service configuration provider.file_scan_service.1.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration
[6/4/18 12:12:09:951 WEST] 0000120a ServiceRegist W Cache put failed for service configuration provider.file_scan_service.1.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration
[6/4/18 12:12:09:952 WEST] 0000120a ServiceRegist W Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
[6/4/18 12:12:10:008 WEST] 0000120a ServiceRegist W Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
[6/4/18 12:12:10:009 WEST] 0000120a ServiceFactor E Error accessing the cache container - Error on GET action for cache Local:SERVICE_FACTORY_CACHE
[6/4/18 12:12:10:009 WEST] 0000120a ServiceFactor E Error accessing the cache container - Error on PUT action for cache Local:SERVICE_FACTORY_CACHE
[6/4/18 12:12:10:013 WEST] 0000120a POJOCache W POJO factory cache lookup failed for service class.Error accessing the cache container - Error on GET action for cache Local:POJO_INVOKER_CACHE
[6/4/18 12:12:10:013 WEST] 0000120a POJOCache W POJO Invoker cache creation failed.Error accessing the cache container - Error on PUT action for cache Local:POJO_INVOKER_CACHE
[6/4/18 12:12:10:013 WEST] 0000120a ServiceRegist W Cache get failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
[6/4/18 12:12:10:014 WEST] 0000120a ServiceRegist W Cache put failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
[6/4/18 12:12:10:014 WEST] 0000120a ServiceRegist W Cache get failed for service configuration system.endpoint_registry.0.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration
[6/4/18 12:12:10:018 WEST] 0000120a ServiceRegist W Cache put failed for service configuration system.endpoint_registry.0.0 Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service_configuration
[6/4/18 12:12:10:018 WEST] 0000120a ServiceRegist W Cache get failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
[6/4/18 12:12:10:019 WEST] 0000120a ServiceRegist W Cache put failed for service system.endpoint_registry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
Do any of you know what can cause this problem? Is there a configuration that we can change so this does not happen?
Any help would be appreciated.
Thank you!
Arer you using Dispatcher for Caching?
Sorry, didn't understood what you mean with dispatcher for cache. At the moment we only use AEM Forms (Previously we used adobe livecycle 3 and we are migrating to AEM 6.3 forms)
We have installed AEM Forms and only imported the applications (workbench) that we had in Adobe Livecycle, no further change or development has been done using new functionalities.
If the product does that by default then yes, as we didn't change anything regarding cache.
Views
Replies
Total Likes
Apart from the suggestions shared by Darren can you confirm the time difference between the two instances and if there is any Antivirus scanning the server folders or if you run any backup utility( for ex. veritas netbackup) on the servers?
I think it is maybe the time difference, in production we have two clusters with one application server each.
we don't any adobe process in production yet and it is also giving similar errors.
just checked 20 seconds diff between them
waiting untill this problem happens again in this environment to check if it is or not.
Views
Replies
Total Likes
20 seconds is a big difference,. For a cluster to operate smoothly, it is essential that the clocks on all of the cluster nodes be closely synchronized. This cannot be done adequately by hand and has to be done by some form of time-sync service that runs very regularly. The clocks on all nodes must be within a second of each other. Best practice dictates that not only the cluster nodes, but also the load balancer, database server, GDS NAS server, and any other components all be synchronized as well.
Views
Replies
Total Likes
by the way what is the best way to check a node time?
I managed to find the difference in PRODUCTION because it's a big difference. but in small differences would be hard...
QUALIFICATION is OK for now, waiting for the errors to return to check if thats the cause.
Views
Replies
Total Likes
you could use from a command prompt: c:\echo %TIME%, for example on both. it will show that instant in time.
Views
Replies
Total Likes
For linux you may use clockdiff (https://linux.die.net/man/8/clockdiff) while
For Win server you may use new-TimeSpan (Use PowerShell to Easily Compare the Time Between Two Computers – Hey, Scripting Guy! Blog ), also you can refer w32time - checking windows time sync from 2 computers with different times - Server Fault
Thanks
Views
Replies
Total Likes
thank you, thats usefull for PRODUCTION where we have two different physical servers, that would only give me the time of the OS itself, I already found that there is a 20 seconds difference in PROD, so that needs to be fixed.
but for now I am trying to figure out QUALIFICATION
I have only 2 application servers, 1 cluster (so , could the two application servers deviate from one another even considering they are runnig in the same machine? and if so how could I check the time individually of each one?)
Someone told me that they once had that problem, the application servers stopped syncronizing with the operating system. So I am wondering if that could be the case
Views
Replies
Total Likes
Hi,
We would need your complete infra details i.e app server, cluster, topology, db , OS etc.
app server:
cluster, topology
2 application servers, 1 cluster
db:
DB2 V11.1
OS:
AIX 7200-02-01-1731
Views
Replies
Total Likes
Are you using AEM Forms 6.4? It's not clear what your version of AEM Forms is. I would have a look at the gemfire.logs, since its the gemfire cache that is shutting down. As Mayank points out, 20s is far too long a delay between nodes. the default timeout is 5000ms.
AEM forms 6.3
can you help me figure out where that logs are?
I don't have full acess to the server unfortunatly.
Views
Replies
Total Likes
The gemfire logs are located in the AEM Forms temp folder:
Linux: %AEMForms_TEMP%/adobejb_server1/Caching/Gemfire.log
Windows: %AEMForms_TEMP%\adobejb_server1\Caching/Gemfire.log
Where %AEMForms_TEMP% is the location you configured the AEM Forms temp folder in LCM..
You could be running into CPU starvation. The error message DCSV0004W relates to this [1]. You could also potentially see the error "HMGR0152W: CPU Starvation detected" in the logs as well?
Check your CPU utilisation and see if any processes are causing 100% utilisation. This could be caused by badly tuned watched folders attempting too many concurrent processes, but there are many other causes e.g.
[2] https://www-01.ibm.com/support/docview.wss?uid=swg27042626&aid=1
@DarrenBiz
I never get that error "HMGR0152W: CPU Starvation detected", the pattern is always the same, one of application servers gives this 3 erros and then all the cache acess returns errors, the other application server still works.
It even starts to happen during the nigth when no one is using the platform...
1)
[8/10/18 13:26:12:207 WEST] 00001218 DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11533660400448) is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.
2)
[8/10/18 13:26:12:380 WEST] 0000005b ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv01: Did not receive adequate CPU time slice. Last known CPU usage time at 13:25:23:543 WEST. Inactivity duration was 18 seconds.
3)
[8/10/18 13:26:15:690 WEST] 000011a4 UMCacheManage A Cache has closed down no further action is allowed
[8/10/18 13:26:15:691 WEST] 000011a4 UMCacheManage W Exception caught while dealing with cache : Action - Put, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE
[8/10/18 13:26:16:312 WEST] 00003863 ServiceRegist W Cache get failed for service DesignTimeServiceRegistry Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
etc...
Other example at 3AM!
[8/5/18 3:26:00:033 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:034 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:036 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:037 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:040 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:043 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:045 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:056 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:107 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:108 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:109 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:109 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:124 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:139 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:140 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:140 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:140 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:141 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:26:00:142 WEST] 000011a2 NamingHelper I org.hibernate.util.NamingHelper getInitialContext JNDI InitialContext properties:{}
[8/5/18 3:33:06:228 WEST] 0000005a ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup at Member shaemaq1Cell01\shaemaq1Node01\AEMSrv02: Did not receive adequate CPU time slice. Last known CPU usage time at 03:32:13:527 WEST. Inactivity duration was 22 seconds.
[8/5/18 3:33:13:391 WEST] 00001212 DSCJobStoreTX W org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances This scheduler instance (shaemaq11532097190491) is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.
[8/5/18 3:33:06:185 WEST] 0000005b CoordinatorCo W HMGR0152W: A thread scheduling delay of 32 seconds was detected.
[8/5/18 3:33:16:124 WEST] 000011a4 UMCacheManage A Cache has closed down no further action is allowed
[8/5/18 3:33:16:174 WEST] 000011a4 UMCacheManage W Exception caught while dealing with cache : Action - Get, ObjectType - UM_CLUSTER_INVALIDATION_CACHE, Exception message - Error accessing the cache container - Error on GET action for cache Replicated:UM_CLUSTER_INVALIDATION_CACHE
[8/5/18 3:33:17:132 WEST] 00001210 Reference I org.apache.xml.security.signature.Reference verify Verification successful for URI "#e53f9977ea98501cadf01ac8a0e487d6"
[8/5/18 3:33:18:628 WEST] 00001210 ServiceRegist W Cache get failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
[8/5/18 3:33:19:703 WEST] 00001210 ServiceRegist W Cache put failed for service provider.file_scan_service Reason: Error accessing the cache container - Error on GET action for cache ReplicatedGuaranteed:dsc_sc_service
[8/5/18 3:33:19:703 WEST] 00001210 Service
What you mean by badly tunned watched folders?
I have three, two listening every 5 seconds and one every 10 seconds
Views
Replies
Total Likes
Watched folders have a bunch of settings that can affect the way that they process contents and how much performance you extract from the system without starving it of CPU cycles. This mostly affects watched folders that are used for batched processes, but ad-hoc can benefit from tuning as well. When a watched folder executes to find files, it scans the whole watched folder directory. This scan takes up processing overhead, especially when there are lots of files in the folder, so you want to make sure you have finished processing that last batch of documents before scanning and picking up a batch of new documents to process. Thottling [1] can also help with this as it will not do another scan for files until the last batch has finished. You have a clustered instance so these jobs will be sent between the instances by the Quartz scheduler so you should get a lot more overhead here. You can also get some benefits from tuning the Scheduler service thread pool if you find yourself running into issues between the other services in the system [2]
Its hard to say what your exact problem is without looking into other things like other VMs running on your underlying infrastructure and if that is affecting anything. There might be another VM taking resources away from a cluster member causing it to be removed from the cluster.
Views
Replies
Total Likes
Hi again,
I attach the GemFire logs.
Any idea?
https://drive.google.com/file/d/1iQHfjt8KMpr8hmHD199AqTVFt4vGhVA9/view?usp=sharing
https://drive.google.com/file/d/1RIpmGmJoqanrqWfpFy9WTUy4cifTaVF-/view?usp=sharing
Thank you!
Views
Replies
Total Likes
The gemfire logs show cache time offset as being too long and it shuts down:
[info 2018/09/04 15:45:25.546 WEST GC_UzmE <IDPSchedulerService_Worker-5> tid=0x120a] Initialization of region Replicated_FormsClusterCache completed
[warning 2018/09/05 05:09:12.703 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 207 ms from earlier offset.
[warning 2018/09/05 11:37:35.186 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 659 ms from earlier offset.
[info 2018/09/06 16:24:35.992 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] The negotiated cache time from shaemaq1(GC_UzmE:21103040)<v0>:35873 is 2018/09/06 16:24:41.357 WEST. Delta from the system clock is 5,656 milliseconds.
[warning 2018/09/06 16:26:16.174 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 5,656 ms from earlier offset.
[info 2018/09/06 17:52:19.546 WEST GC_UzmE <WebContainer : 8> tid=0x296f] Created disk store Global_DeployLockMonitorCache with unique id e48ef1f85c184519-94bb514ce53e7729
[info 2018/09/06 17:52:19.583 WEST GC_UzmE <WebContainer : 8> tid=0x296f] recovery region initialization took 1 ms
[info 2018/09/06 17:52:19.631 WEST GC_UzmE <WebContainer : 8> tid=0x296f] Initializing region Global_DeployLockMonitorCache
[info 2018/09/06 17:52:19.675 WEST GC_UzmE <WebContainer : 8> tid=0x296f] Initialization of region Global_DeployLockMonitorCache completed
[info 2018/09/07 15:11:27.081 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] The negotiated cache time from shaemaq1(GC_UzmE:21103040)<v0>:35873 is 2018/09/07 15:11:27.995 WEST. Delta from the system clock is 1,060 milliseconds.
[warning 2018/09/07 15:13:07.224 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 1,060 ms from earlier offset.
[warning 2018/09/10 11:33:20.203 WEST GC_UzmE <GemFire Time Service> tid=0x11c1] New cache time offset calculated is off more than 346 ms from earlier offset.
[info 2018/09/10 17:02:28.485 WEST GC_UzmE <UDP mcast receiver> tid=0x11b3] Membership: received new view [shaemaq1(GC_6T7h:17170734)<v1>:16233|12] [shaemaq1(GC_6T7h:17170734)<v1>:16233/63465] crashed mbrs: [shaemaq1(GC_UzmE:21103040)<v0>:35873/14545]
[info 2018/09/10 17:02:28.903 WEST GC_UzmE <CloserThread> tid=0x13f45] CacheServer configuration saved
[severe 2018/09/10 17:02:28.906 WEST GC_UzmE <CloserThread> tid=0x13f45] Membership service failure: Channel closed: com.gemstone.gemfire.ForcedDisconnectException: This member has been forced out of the distributed system by shaemaq1(GC_6T7h:17170734)<v1>:16233. Please consult GemFire logs to find the reason. (GMS shun)
[info 2018/09/10 17:02:29.191 WEST GC_UzmE <ReconnectThread> tid=0x13f45] GemFireCache[id = 2139782586; isClosing = true; isShutDownAll = false; closingGatewayHubsByShutdownAll = false; created = Mon Sep 03 17:42:19 WEST 2018; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.
I've added this two options:
-Dgemfire.member-timeout=15000 -Xgcpolicy:gencon
hopefully it will solve the problem, let's wait to see.
thanks to all.
There are few options to resolve the issue:
|
Views
Likes
Replies