Need recommendations to delete duplicate data in publishers

Question

Hi Team, Issue- AEM author having  around 35k documents and publishers having at least 2k docs more than author. Those extra documents are duplicate documents. Which need to be deleted. Below are the options we are having. Please verify and comment on this options.                                                                                                                                                                                                                                            1) Remove the total documents from publisher and copy the whole content from author to publisher (Not using replication engine for this, just copy the content from author to publishers). If we copy like this through package manager/webdev or any other tool, sync between author and publisher page will be there or NO? 2) Remove the total documents from publisher and activate again from author through replicate engines.3) Last option is identifying extra documents and delete those in publishers itself. Lot of manual work required for this. Please suggest  recommend approach for this issue. Thanks a lot in advance. Regards,Chandra

Sanket_Kumbharkhane · Accepted Answer

Hi @chandramohanred ,

I hope below code snippet will help you

Upload that excel in Dam and gives a path in the input field of the servlet.

@Override
protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) throws ServletException, IOException {
response.setContentType("text/html");
response.getWriter().write("<form action=\"/bin/sync/author-publish/content\" method=\"POST\"><input type=\"text\" name=\"path\"><input type=\"submit\" value=\"submit\"></form>");
}

@Override
protected void doPost(SlingHttpServletRequest request, SlingHttpServletResponse response) throws ServletException, IOException {
response.setContentType("text/html");
String path = request.getParameter("path");

if(StringUtils.isEmpty(path)){
response.sendError(400, "No path found");
return;
}

resourceResolver = request.getResourceResolver();
Resource resource = resourceResolver.getResource(path);
if (resource == null) {
response.sendError(400, "No Resource found at the path: " + path);
return;
}

Asset asset = resource.adaptTo(Asset.class);
if (asset == null) {
response.sendError(400, "Invalid Asset at path: " + path);
return;
}

Rendition rendition = asset.getOriginal();
if (rendition == null || !"text/csv".equalsIgnoreCase(rendition.getMimeType())) {
response.sendError(400, "Corrupted Asset at path: " + path);
return;
}

InputStream inputStream = rendition.getStream();
BufferedReader br = new BufferedReader(new InputStreamReader(inputStream));
String line = "";
while ((line = br.readLine()) != null) {
deletePage(line, response);
}
}

you can write the logic in the deletePage method to delete the pages/node. you can use resource/page/node API etc.

Sanket_Kumbharkhane · Answer

Hi @chandramohanred ,

If your documents are in one parent folder, the best way is to delete that folder from publish instance, create a new package & build it in the author with parent folder path and replicate it from the More option. Please consider all the demerit of document unavailability if you directly running on Production.

if all the documents are scattered across multiple folders and hierarchy, you can use the query builder to get the list of documents from both the instance [Author & Publish]

/libs/cq/search/content/querydebug.html

use any editor tool to find the difference and that 2k extra path list. Dump that list into an excel sheet.

You can develop a small utility & write a mini servlet to read that excel sheet with paths & delete the node programmatically.

In this way, you can always reuse this script and maintain sync between Author & Publish anytime whenever needed by uploading excel.

Note: The servlet path should be accessible only to the admin.

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded