Identify image-based (scanned) pdfs in the DAM?

Avatar

Avatar
Level 1
jerryle
Level 1

Likes

0 likes

Total Posts

3 posts

Correct reply

0 solutions
View profile

Avatar
Level 1
jerryle
Level 1

Likes

0 likes

Total Posts

3 posts

Correct reply

0 solutions
View profile
jerryle
Level 1

06-04-2021

Hi,

Hoping there might be a way to identify pdfs (we have almost 20,000 in the DAM in a number of folders) that are image-based (scanned)....usually created when the owner could not locate the original digital copy and ended up just scanning a paper copy and creating a pdf. Over the years, we ended up with hundreds of these.

 

I thought maybe a very simple solution would work - in site admin>aem assets and using the search, with "-the" in fulltext, and file type=pdf (the idea being that since the pdfs are image based, there would not be any text to find and "the" is a pretty common word). This actually does find some but also end up with lots of "false positives".

 

Would there be a way of doing this via coding/programming (not something I can do but can pass on to our programmer) or even an "advanced search query"?

 

Jerry

 

Accepted Solutions (1)

Accepted Solutions (1)

Avatar

Avatar
Boost 500
MVP
Vijayalakshmi_S
MVP

Likes

573 likes

Total Posts

728 posts

Correct reply

240 solutions
Top badges earned
Boost 500
Give Back 50
Give Back 5
Ignite 10
Ignite 5
View profile

Avatar
Boost 500
MVP
Vijayalakshmi_S
MVP

Likes

573 likes

Total Posts

728 posts

Correct reply

240 solutions
Top badges earned
Boost 500
Give Back 50
Give Back 5
Ignite 10
Ignite 5
View profile
Vijayalakshmi_S
MVP

07-04-2021

Hi @jerryle,

One way is to do search based on PDF filenames if it follows some pattern by framing a regex. Consider below sample which will bring in JPG images that starts with "DSC" from two different locations within DAM. 

Vijayalakshmi_S_0-1617803299896.png

 

Could you please elaborate your use case of identifying PDFs. Exact need of identifying the same - Is it is needed at author level in existing Omnisearch or to be displayed as part of your web application. 

Query/approach decision might differ depending on the need. 

SNBpatrickv

Hello @Vijayalakshmi_S,

 

I work with Jerry. Allow me to elaborate. We have .PDF files in our DAM that are purely image-based, they are a non-OCR scanned copy of a document, which means they are not searchable. This means they do not meet accessibility standards, and potentially impact our SEO.

 

We are trying to identify which of the 20000+ PDFs in the DAM are as described above so we can try to either re-scan them if we can find the originals, or attempt to OCR the existing file.

 

Is there a way that we can do this without manually opening every PDF in the DAM?

 

Thanks!

Vijayalakshmi_S

Hi @SNBpatrickv,

We can write a simple one time utility(to execute only in author) to list the PDFs which are searchable and non-searchable. Idea is that scanned PDF will not have text content within it.

Please find below sample implementation for reference. 

  • List all PDF from DAM using Query Builder API and
  • Extract text from PDF and log its type based on length check using Apache PDFBox

Can refine the query predicates based on PDF asset metadata (to be more specific so that the query engine need not traverse many nodes)

If you would like to do it in batches instead of from one root path - Fetch the path dynamically from request and execute in batches or can create multiple path predicates with group (as mentioned in my previous comment)

 

@Override
	protected void doGet(final SlingHttpServletRequest req, final SlingHttpServletResponse resp) {

		resp.setContentType("text/html");
		PDFTextStripper pdfStripper;
		try {
			pdfStripper = new PDFTextStripper();

			Map<String, String> predicatesMap = new HashMap<String, String>();
			ResourceResolver rescResolver = req.getResourceResolver();
			Session session = rescResolver.adaptTo(Session.class);

			predicatesMap.put("path", "/content/dam/demo");
			predicatesMap.put("nodename", "*.pdf");
			predicatesMap.put("type", "dam:Asset");
			predicatesMap.put("p.limit", "-1");

			Query query = queryBuilder.createQuery(PredicateGroup.create(predicatesMap), session);
			SearchResult queryResults = query.getResult();
			LOG.debug("Total number of results={}", queryResults.getTotalMatches());

			queryResults.getHits().forEach(hit -> {
				try {
					String assetPath = hit.getPath();
					Asset assetObj = Optional.ofNullable(rescResolver.resolve(assetPath)).get().adaptTo(Asset.class);
					InputStream is = assetObj.getRendition("original").getStream();
					PDDocument pdfDoc = PDDocument.load(is);
					String pdfContent = pdfStripper.getText(pdfDoc);
					if (null != pdfContent) {
						pdfContent = pdfContent.trim();
					}
					LOG.debug("PDF Content length={}", pdfContent.length());
					if (pdfContent.length() > 0 && !pdfContent.equals("")) {
						LOG.info("PDF at the path = {} is Searchable", assetObj.getPath());
						resp.getWriter().write("PDF at the path " + assetObj.getPath() + " is Searchable<br>");
					} else {
						LOG.info("PDF at the path = {} is Non-Searchable", assetObj.getPath());
						resp.getWriter().write("PDF at the path " + assetObj.getPath() + " is <b>Non-Searchable</b><br>");
					}
					pdfDoc.close();
				} catch (RepositoryException e) {
					LOG.error("RepositoryException={}", e.getMessage());
				} catch (IOException e) {
					LOG.error("IOException={}", e.getMessage());
				}
			});
		} catch (IOException e1) {
			LOG.error("IOException={}", e1.getMessage());
		}

	}

 

Vijayalakshmi_S_0-1617822415109.png

jerryle
Thank you very much......we have forwarded your response to our AEM programmer.

Answers (0)