Expand my Community achievements bar.

Dive into Adobe Summit 2024! Explore curated list of AEM sessions & labs, register, connect with experts, ask questions, engage, and share insights. Don't miss the excitement.

CQ 5.6 - Saving Published Web Pages?

Avatar

Level 3

Hello,

Our legal department has asked that we archive all of the Intranet web pages regarding our company news & history as a PDF.  What they want is a viewable version of what was published and believe it or not each week a person is opening each page and saving as PDF!!! 

Is there a way to automate crawling the libraries and saving a PDF file within the CQ environment (author or publisher)?  


At this point, I am still thinking a PDF is best and we could load these artifacts into AEM Assets and use OCR/AI type features to make any part of the asset findable.  We could then publish within our Asset Share libraries and maintain the records along with other Corporate Archives.  But I am open to other ideas/solutions.

FYI, we discussed the following but determined this isn't the best solution for a multi-billion dollar company - until we run out of options

  1. Setup a separate instance of AEM, copy the content and use as Archive - but would require additional hardware & maintenance (too scrappy)
  2. Use and open source tool like Heritex to crawl the published pages (too risky for large corp enterprise, Infosec would have to approve)
  3. Create a new PRINT template that includes all of the images, text and comments on the page to facilitate the PDF creation (but doesn't solve for the volume of pages)
4 Replies

Avatar

Level 10

IN AEM 5.6 - there is no OOTB feature that would perform this use case. This may require a custom solution. If you need help doing this - you can reach out to the AEM consulting team too.

Avatar

Level 3

Thanks!   Do you know of any other customer that has addressed the need to 'save off' web pages?

Avatar

Level 10

Not that know of -- i will reach out to AEM people internally.

Avatar

Level 10

One way to proceed here is to look at this blog:

Get the rendered HTML for an AEM resource, component or page - Adobe Experience Manager | AEM/CQ | A...

So you can read the HTML (as shown here) and use then a lib like PDFBOX to generate the PDF