CQ 5.6 - Saving Published Web Pages?

Avatar

Avatar

Gl369

Avatar

Gl369

Gl369

08-12-2017

Hello,

Our legal department has asked that we archive all of the Intranet web pages regarding our company news & history as a PDF.  What they want is a viewable version of what was published and believe it or not each week a person is opening each page and saving as PDF!!! 

Is there a way to automate crawling the libraries and saving a PDF file within the CQ environment (author or publisher)?  


At this point, I am still thinking a PDF is best and we could load these artifacts into AEM Assets and use OCR/AI type features to make any part of the asset findable.  We could then publish within our Asset Share libraries and maintain the records along with other Corporate Archives.  But I am open to other ideas/solutions.

FYI, we discussed the following but determined this isn't the best solution for a multi-billion dollar company - until we run out of options

  1. Setup a separate instance of AEM, copy the content and use as Archive - but would require additional hardware & maintenance (too scrappy)
  2. Use and open source tool like Heritex to crawl the published pages (too risky for large corp enterprise, Infosec would have to approve)
  3. Create a new PRINT template that includes all of the images, text and comments on the page to facilitate the PDF creation (but doesn't solve for the volume of pages)
View Entire Topic

Avatar

Avatar

Gl369

Avatar

Gl369

Gl369

08-12-2017

Thanks!   Do you know of any other customer that has addressed the need to 'save off' web pages?