Our legal department has asked that we archive all of the Intranet web pages regarding our company news & history as a PDF. What they want is a viewable version of what was published and believe it or not each week a person is opening each page and saving as PDF!!!
Is there a way to automate crawling the libraries and saving a PDF file within the CQ environment (author or publisher)?
At this point, I am still thinking a PDF is best and we could load these artifacts into AEM Assets and use OCR/AI type features to make any part of the asset findable. We could then publish within our Asset Share libraries and maintain the records along with other Corporate Archives. But I am open to other ideas/solutions.
FYI, we discussed the following but determined this isn't the best solution for a multi-billion dollar company - until we run out of options
Setup a separate instance of AEM, copy the content and use as Archive - but would require additional hardware & maintenance (too scrappy)
Use and open source tool like Heritex to crawl the published pages (too risky for large corp enterprise, Infosec would have to approve)
Create a new PRINT template that includes all of the images, text and comments on the page to facilitate the PDF creation (but doesn't solve for the volume of pages)