We migrated our AEM v5.6.1 stack from on-premise hardware to AWS as part of a major effort in 2014/2015 to re-vamp our (TE Connectivity) entire web site (www.te.com) and how all of the product catalog information is maintained and dynamically provided to the front-end (AEM). The project was very large for us so we'd hired RazorFish on to basically do the project for us but the AWS infrastructure part of it was largely on us especially at the start.
We new that ordinarily you cannot take a hot back of AEM unless you use the mechanism provided by AEM but as things worked out we were scrambling to provide AEM stacks in the AWS environment in time to prevent a hold of any development activity. It was a near thing. So we didn't really have our ducks in row regarding the back-ups before the development team destroyed the QA Authoring instance to the point where it needed to be recovered from back-up and RazorFish was asking for this as if we should have it. Well we were smart enough to place AEM on an EBS volume all by itself. The idea for that had nothing to do with backups. The thought was if we wanted to adjust the host EC2 instance regarding memory, CPU, etc.. we could do that very easily by simply adjusting our CloudFormation scripts, creating a new host server with whatever adjustment we wanted and then basically throwing out the AEM volume on the new stack and replacing it with a snapshot taken of the existing EC2 instance's AEM volume with AEM off (cold). But the idea for that was that we have to stop the instance as needed to get the snapshot for the new stack and it would stay off until we got the new stack up with an AEM volume created from that snapshot. Here we didn't have a snapshot of the AEM author instance taken while it was down; all we had where backups that AWS was taking of every volume in our VPC every day at some specific time. So we went with what we had to recover the destroyed QA AEM Authoring instance and were amazed when it actually came up on a new volume created from the snapshot taken while AEM was hot. This in fact happened a second time and we did the same thing and again it worked. We started figuring there was something special about AWS snapshots that allowed them to work as a means of taking AEM hot back-ups so we just kind of went with it for awhile; bigger fish to fry.
Further along in the project somebody realized what we were doing for an AEM back-up strategy and questioned it. At that point we laid it all out on the table as to how the AWS snapshots seemed to have some magical powers that allowed then to be viable as AEM hot backups. RazorFish wasn't sure so they said they'd get back to us on it. When they did, to our surprise, the confirmed that what we were doing was viable and that was that; the AWS snapshots or our AEM volumes mounted to our AEM EC2 instances became our back-up strategy for AEM instances.
Almost a year later we get our first call to restore an AEM QA publishing instance and we tried to restore it in the same way using a snapshot taken of the other AEM publishing instance in the cluster while it was still up. But wouldn't you know we're not having problems. Instance we're trying to recover won't start-up using the hot back-up of the other instance taken as AWS snapshot. I decide to go to Adobe myself about this myself with a trouble ticket. They tell me that you can do what I'm doing but first you have to issue an fsfreeze (Linux utility) to the volume and then only after you've done that take the AWS snapshot the, of course, unfreeze the volume. But that's all they told me. They wouldn't provide me with a script. Anybody know if that is really all there is to it? What happens if your script aborts before you get back to that fsfreeze -u to do the unfreeze. I imagine that can't go for very long before all the writes it's queuing up start to become a problem of some kind.
Anybody got the full story on this?