Expand my Community achievements bar.

Content Archival strategy

Avatar

Employee Advisor

9/30/20

As different client has different data archival standards, we need to create archival process for various clients. In AEM, since the content goes into the repository, anything that comes inside the repo becomes a data that is moved across environments. Moreover, we need to have a process for content editors for moving archived content to some other tree or space and to keeping the content tree clean and only with needed content.

 

It will be great if you can support an archival strategy OOTB. The features I look for in this feature is

1. Easy to manage from authoring experience

2. Reduced to a small size

3. It should be outside the repository so that we don't need to move that as content in the repo.

4. We should be able to search through the metadata

5. We should be able to retrieve the content with some action.

6 Comments

Avatar

Employee Advisor

9/30/20

What are your requirements regarding archival? If these are legal requirements, you can an archival solution which satisfies all these legal requirements.

 

I typically recommend this archival strategy for pages: Together with the publishing process you create a PDF/A of the published page, which contains all relevant metadata as well. Store this in your archival solution.

 The archival solution must then provide the necessary features like retrieval, search for metadata, audit trail, etc.

 

Jörg

Avatar

Employee Advisor

10/1/20

Thanks @Jörg_Hoh for the feedback. Yes, agreed generating a PDF and store will help for the archival for legal compliance.

 

One of the clients requirement are as follows.

1. They have IaC standards and they recreate environments from prod aws backup snapshots.

2. Since the whole repo is in the snapshot, they want it to be lean as possible.

3. They have tight content workflows were pages are created, activated, deactivated and deleted on schedule.

4. But, they really don't want to delete a page after deactivation and still want it searchable.

5. They want to search through the archives and bring a page back if needed.

 

I have a content packaging solution after moving into an archival tree and getting this package into a special archive environment. But, looking forward for a more standardised approach.

Avatar

Employee Advisor

10/1/20

Archival is a tough thing, also because it's so diverse.

 

First of all, from my point of view, archival is not backup. You archive things which are finished and closed, and you need to keep them around for mostly legal reasons. Just storing files on a fileshare is not an option, because everyone can read and write on them (you cannot check for integrity), you don't have an audit trail, and you can not delete these records on time (because you must not keep them longer than required).

For this requirement I always recommend PDFs and a dedicated archival solution. Because AEM is not an application for archival.

 

When you want to have "old" content available and searchable, I don't see any other chance than to retain it in an archive folder, remove write access for most/all users and try to reduce the number of versions for them (you don't need them). But be sure, that you understand the drawbacks of that:

  • no one can guarantee the integrity of these unpublished pages.
  • They create some overhead, as they consist of JCR nodes and that increases the size of some indexes
  • You will always have them with you, you cannot externalize it.
  • You always have to consider them when your application is evolving. That means you either consider them with every style and component change, or you find a different solution for it. What about the assets referenced in these pages? And what if the rendering components/scripts require certain OSGI services/components in a certain version?

Especially the last item can be a longterm burden your development velocity, the more old stuff you have the more you need to invest in it.

 

Some suggestions how you can improve it (of course everything is customization and not available ootb):

  • Think about you can reduce the overhead of these pages, and potentially even transform into something which can standalone and does not have any dependency to the application itself. For example a PDF. You should still be able to find all text in the fulltext index.
  • Then you could reconfigure the ootb indexes not to cover your archive area anymore, but instead feed all these data into an external search engine, and let the authors search the archive only there.

And there are probably a ton more possibilities.

 

The most important thing you should consider is the impact of an "in-repo" archive to your application velocity. As long as you maintain this old content as pages, you have to test it.

 

HTH,

Jörg

Avatar

Employee Advisor

10/7/20

make sense @Jörg_Hoh . I can see two more ideas came up which is better than mine and can fulfill my requirements - trash bin and jcr:versionhistory offloading.

 

As you said, archival is better placed to be outside and not retrievable back as a page - a PDF copy.