Hoping there might be a way to identify pdfs (we have almost 20,000 in the DAM in a number of folders) that are image-based (scanned)....usually created when the owner could not locate the original digital copy and ended up just scanning a paper copy and creating a pdf. Over the years, we ended up with hundreds of these.
I thought maybe a very simple solution would work - in site admin>aem assets and using the search, with "-the" in fulltext, and file type=pdf (the idea being that since the pdfs are image based, there would not be any text to find and "the" is a pretty common word). This actually does find some but also end up with lots of "false positives".
Would there be a way of doing this via coding/programming (not something I can do but can pass on to our programmer) or even an "advanced search query"?
One way is to do search based on PDF filenames if it follows some pattern by framing a regex. Consider below sample which will bring in JPG images that starts with "DSC" from two different locations within DAM.
Could you please elaborate your use case of identifying PDFs. Exact need of identifying the same - Is it is needed at author level in existing Omnisearch or to be displayed as part of your web application.
Query/approach decision might differ depending on the need.