I need to create a report on broken links inside the site, so please suggest me how to get the links from the page content and how to check whether the link is valid or invalid programatically.
Thanks in advance.
Using Jsoup we can parse the html and get the links. Once the links are retrieved you can check whether the link is valid or not.
Hope this helps!
You can write a groovy script that crawls over your /content/<site> looking for strings that start with /content. Then, use ResourceResolver to verify whether those paths exist.
The following links may be helpful:
a) Sample Groovy Script => https://gist.github.com/trekawek/72b3515a6641ca5f4b29
b) ResourceResolver API => https://helpx.adobe.com/experience-manager/6-4/sites/developing/using/reference-materials/javadoc/or...
c) Community Article => https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/broken-link-scan/qaq-p/220...
I hope it helps. 🙂
Thanks for your suggestion, actually I need to get the html content of an internal page in my servlet/service, so that I can get the href present in it. Can you help me in reading the content of a page in aem.
I have no idea on groovy, my requirement is to be done in java using servlet/service.
Please suggest me a way to get the content of internal page and read the href's present in it and check whether those links are valid or not using Java.
Thank you for clarifying it. 🙂
You can use any HTML Parser library(eg: JSoup HTML Parser) to do that. Include that dependency in pom.xml file and then use it to read HTML content or even links on any internal page.
Sample Reference code can be found here:
You can include the similar code in your servlet to achieve your use case.
I hope it helps !! 🙂