Adobe Experience Manager Sites & More

jerryle · 4/6/21

Hi,

Hoping there might be a way to identify pdfs (we have almost 20,000 in the DAM in a number of folders) that are image-based (scanned)....usually created when the owner could not locate the original digital copy and ended up just scanning a paper copy and creating a pdf. Over the years, we ended up with hundreds of these.

I thought maybe a very simple solution would work - in site admin>aem assets and using the search, with "-the" in fulltext, and file type=pdf (the idea being that since the pdfs are image based, there would not be any text to find and "the" is a pretty common word). This actually does find some but also end up with lots of "false positives".

Would there be a way of doing this via coding/programming (not something I can do but can pass on to our programmer) or even an "advanced search query"?

Jerry

Vijayalakshmi_S · 4/7/21

Hi @jerryle,

One way is to do search based on PDF filenames if it follows some pattern by framing a regex. Consider below sample which will bring in JPG images that starts with "DSC" from two different locations within DAM.

Could you please elaborate your use case of identifying PDFs. Exact need of identifying the same - Is it is needed at author level in existing Omnisearch or to be displayed as part of your web application.

Query/approach decision might differ depending on the need.

View solution in original post

Vijayalakshmi_S · 4/7/21

Hi @jerryle,

One way is to do search based on PDF filenames if it follows some pattern by framing a regex. Consider below sample which will bring in JPG images that starts with "DSC" from two different locations within DAM.

Could you please elaborate your use case of identifying PDFs. Exact need of identifying the same - Is it is needed at author level in existing Omnisearch or to be displayed as part of your web application.

Query/approach decision might differ depending on the need.

SNBpatrickv · 4/7/21

Hello @Vijayalakshmi_S,

I work with Jerry. Allow me to elaborate. We have .PDF files in our DAM that are purely image-based, they are a non-OCR scanned copy of a document, which means they are not searchable. This means they do not meet accessibility standards, and potentially impact our SEO.

We are trying to identify which of the 20000+ PDFs in the DAM are as described above so we can try to either re-scan them if we can find the originals, or attempt to OCR the existing file.

Is there a way that we can do this without manually opening every PDF in the DAM?

Thanks!

Vijayalakshmi_S · 4/7/21

Hi @SNBpatrickv,

We can write a simple one time utility(to execute only in author) to list the PDFs which are searchable and non-searchable. Idea is that scanned PDF will not have text content within it.

Please find below sample implementation for reference.

List all PDF from DAM using Query Builder API and
Extract text from PDF and log its type based on length check using Apache PDFBox

Can refine the query predicates based on PDF asset metadata (to be more specific so that the query engine need not traverse many nodes)

If you would like to do it in batches instead of from one root path - Fetch the path dynamically from request and execute in batches or can create multiple path predicates with group (as mentioned in my previous comment)

@Override
	protected void doGet(final SlingHttpServletRequest req, final SlingHttpServletResponse resp) {

		resp.setContentType("text/html");
		PDFTextStripper pdfStripper;
		try {
			pdfStripper = new PDFTextStripper();

			Map<String, String> predicatesMap = new HashMap<String, String>();
			ResourceResolver rescResolver = req.getResourceResolver();
			Session session = rescResolver.adaptTo(Session.class);

			predicatesMap.put("path", "/content/dam/demo");
			predicatesMap.put("nodename", "*.pdf");
			predicatesMap.put("type", "dam:Asset");
			predicatesMap.put("p.limit", "-1");

			Query query = queryBuilder.createQuery(PredicateGroup.create(predicatesMap), session);
			SearchResult queryResults = query.getResult();
			LOG.debug("Total number of results={}", queryResults.getTotalMatches());

			queryResults.getHits().forEach(hit -> {
				try {
					String assetPath = hit.getPath();
					Asset assetObj = Optional.ofNullable(rescResolver.resolve(assetPath)).get().adaptTo(Asset.class);
					InputStream is = assetObj.getRendition("original").getStream();
					PDDocument pdfDoc = PDDocument.load(is);
					String pdfContent = pdfStripper.getText(pdfDoc);
					if (null != pdfContent) {
						pdfContent = pdfContent.trim();
					}
					LOG.debug("PDF Content length={}", pdfContent.length());
					if (pdfContent.length() > 0 && !pdfContent.equals("")) {
						LOG.info("PDF at the path = {} is Searchable", assetObj.getPath());
						resp.getWriter().write("PDF at the path " + assetObj.getPath() + " is Searchable<br>");
					} else {
						LOG.info("PDF at the path = {} is Non-Searchable", assetObj.getPath());
						resp.getWriter().write("PDF at the path " + assetObj.getPath() + " is <b>Non-Searchable</b><br>");
					}
					pdfDoc.close();
				} catch (RepositoryException e) {
					LOG.error("RepositoryException={}", e.getMessage());
				} catch (IOException e) {
					LOG.error("IOException={}", e.getMessage());
				}
			});
		} catch (IOException e1) {
			LOG.error("IOException={}", e1.getMessage());
		}

	}