How to scan assets while uploading to find if an asset contains personal data | Community
Skip to main content
This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by Saravanan_Dharmaraj

@vpasam You can write a simple workflow step to remove any PII data during the publishing process. So that asset delivered or consumed wont have the PII data. 

Below is a sample code , you can add other properties too if you know. 

 

/** * The method called by the AEM Workflow Engine to perform Workflow work. * * @param workItem the work item representing the resource moving through the Workflow * @param workflowSession the workflow session * @param args arguments for this Workflow Process defined on the Workflow Model (PROCESS_ARGS, argSingle, argMulti) * @throws WorkflowException when the Workflow Process step cannot complete. This will cause the WF to retry. */ @Override public void execute(WorkItem workItem, WorkflowSession workflowSession, MetaDataMap args) throws WorkflowException { /* Get the Workflow Payload */ // Get the Workflow data (the data that is being passed through for this work item) final WorkflowData workflowData = workItem.getWorkflowData(); final String type = workflowData.getPayloadType(); final ResourceResolver resourceResolver = workflowSession.adaptTo(ResourceResolver.class); // Check if the payload is a path in the JCR; The other (less common) type is JCR_UUID if (!StringUtils.equals(type, TYPE_JCR_PATH)) { return; } // Get the path to the metadata node on the JCR resource from the payload final String path = getAssetPathFromPayload(workflowData); log.debug("MetadataCleanup Payloadpath:: {} ", path); Resource assetResource = resourceResolver.getResource(path); final Resource assetMetadataRes = assetResource.getChild("jcr:content/metadata"); final ModifiableValueMap modifiableValueMap = assetMetadataRes.adaptTo(ModifiableValueMap.class); Map<String, Object> properties = new HashMap<>(); properties.put("dc:creator", new String[] { "" }); properties.put("xmp:CreatorTool", ""); properties.put("dam:Author", ""); properties.put("dam:Producer", ""); properties.put("pdf:Producer", ""); properties.put("dc:rights", ""); properties.put("dc:Rights", ""); properties.put("photoshop:Credit", ""); final Set<Entry<String, Object>> propertyEntries = properties.entrySet(); for (final Entry<String, Object> propertyEntry : propertyEntries) { if (modifiableValueMap.containsKey(propertyEntry.getKey())) { modifiableValueMap.remove(propertyEntry.getKey()); } modifiableValueMap.put(propertyEntry.getKey(), propertyEntry.getValue()); log.debug("Updating property '{}' with value '{}' for resource at path '{}'.", propertyEntry.getKey(), propertyEntry.getValue(), assetMetadataRes.getPath()); } commit(resourceResolver); }

 

4 replies

Himanshu_Jain
Community Advisor
Community Advisor
September 16, 2024

Hi @vpasam ,

There is no OOTB utility available as of now in AEM cloud service to scan PII data in assets.

There are multiple ways to minimize this risk like use third party tools for validation before uploading in aem . setup workflow approval process ( manual approval) .

 

Thanks

Himanshu Jain
Saravanan_Dharmaraj
Community Advisor
Saravanan_DharmarajCommunity AdvisorAccepted solution
Community Advisor
September 16, 2024

@vpasam You can write a simple workflow step to remove any PII data during the publishing process. So that asset delivered or consumed wont have the PII data. 

Below is a sample code , you can add other properties too if you know. 

 

/** * The method called by the AEM Workflow Engine to perform Workflow work. * * @param workItem the work item representing the resource moving through the Workflow * @param workflowSession the workflow session * @param args arguments for this Workflow Process defined on the Workflow Model (PROCESS_ARGS, argSingle, argMulti) * @throws WorkflowException when the Workflow Process step cannot complete. This will cause the WF to retry. */ @Override public void execute(WorkItem workItem, WorkflowSession workflowSession, MetaDataMap args) throws WorkflowException { /* Get the Workflow Payload */ // Get the Workflow data (the data that is being passed through for this work item) final WorkflowData workflowData = workItem.getWorkflowData(); final String type = workflowData.getPayloadType(); final ResourceResolver resourceResolver = workflowSession.adaptTo(ResourceResolver.class); // Check if the payload is a path in the JCR; The other (less common) type is JCR_UUID if (!StringUtils.equals(type, TYPE_JCR_PATH)) { return; } // Get the path to the metadata node on the JCR resource from the payload final String path = getAssetPathFromPayload(workflowData); log.debug("MetadataCleanup Payloadpath:: {} ", path); Resource assetResource = resourceResolver.getResource(path); final Resource assetMetadataRes = assetResource.getChild("jcr:content/metadata"); final ModifiableValueMap modifiableValueMap = assetMetadataRes.adaptTo(ModifiableValueMap.class); Map<String, Object> properties = new HashMap<>(); properties.put("dc:creator", new String[] { "" }); properties.put("xmp:CreatorTool", ""); properties.put("dam:Author", ""); properties.put("dam:Producer", ""); properties.put("pdf:Producer", ""); properties.put("dc:rights", ""); properties.put("dc:Rights", ""); properties.put("photoshop:Credit", ""); final Set<Entry<String, Object>> propertyEntries = properties.entrySet(); for (final Entry<String, Object> propertyEntry : propertyEntries) { if (modifiableValueMap.containsKey(propertyEntry.getKey())) { modifiableValueMap.remove(propertyEntry.getKey()); } modifiableValueMap.put(propertyEntry.getKey(), propertyEntry.getValue()); log.debug("Updating property '{}' with value '{}' for resource at path '{}'.", propertyEntry.getKey(), propertyEntry.getValue(), assetMetadataRes.getPath()); } commit(resourceResolver); }

 

h_kataria
Community Advisor
Community Advisor
September 17, 2024

You can also possibly implement a sling filter which can filter your upload requests and then validate them as per your requirement.

vpasamAuthor
September 17, 2024

Hi,

How can I read asset content in filter class?

Rohan_Garg
Community Advisor
Community Advisor
September 17, 2024

You can use a custom workflow to trim/mask the metadata which might be exposing PII.
For the actual asset binary the trick is to identify the assets - post identification you can either use AEM's in-built image editor or a custom image processing library like OpenCV to blur the PII areas and then re-upload the sanitised version back to AEM.
For PDFs, libraries like Apache PDFBox or iText can be used to programmatically redact text.


For identification you can explore OCR library such as Tesseract OCR to process images and extract text for PII scanning. Once the text is extracted use regex maybe to identify the PII (Emails, Phone Numbers, Names etc.)
Similarly use libraries like Apache PDFBox (for PDFs) or Apache POI (for Word documents) to extract text from documents.
These services would have to be integrated into AEM as part of post-processing workflow to allow the asset to be processed post upload in AEM.

Hope this can help!