Expand my Community achievements bar.

Join us in celebrating the outstanding achievement of our AEM Community Member of the Year!
SOLVED

How to scan assets while uploading to find if an asset contains personal data

Avatar

Level 1

My project has a requirement to restrict asset uploads if an asset contains personal data. How can it be achieved on AEMaaCS

 

Topics

Topics help categorize Community content and increase your ability to discover relevant content.

1 Accepted Solution

Avatar

Correct answer by
Community Advisor

@vpasam You can write a simple workflow step to remove any PII data during the publishing process. So that asset delivered or consumed wont have the PII data. 

Below is a sample code , you can add other properties too if you know. 

 

/**
	 * The method called by the AEM Workflow Engine to perform Workflow work.
	 *
	 * @param workItem        the work item representing the resource moving through the Workflow
	 * @param workflowSession the workflow session
	 * @param args            arguments for this Workflow Process defined on the Workflow Model (PROCESS_ARGS, argSingle, argMulti)
	 * @throws WorkflowException when the Workflow Process step cannot complete. This will cause the WF to retry.
	 */
	@Override
	public void execute(WorkItem workItem, WorkflowSession workflowSession, MetaDataMap args) throws WorkflowException {
		/* Get the Workflow Payload */
		// Get the Workflow data (the data that is being passed through for this work item)

		final WorkflowData workflowData = workItem.getWorkflowData();
		final String type = workflowData.getPayloadType();
		final ResourceResolver resourceResolver = workflowSession.adaptTo(ResourceResolver.class);
		// Check if the payload is a path in the JCR; The other (less common) type is JCR_UUID
		if (!StringUtils.equals(type, TYPE_JCR_PATH)) {
			return;
		}
		// Get the path to the metadata node on the JCR resource from the payload
		final String path = getAssetPathFromPayload(workflowData);
		log.debug("MetadataCleanup Payloadpath:: {} ", path);
		Resource assetResource = resourceResolver.getResource(path);
		final Resource assetMetadataRes = assetResource.getChild("jcr:content/metadata");
		final ModifiableValueMap modifiableValueMap = assetMetadataRes.adaptTo(ModifiableValueMap.class);
		Map<String, Object> properties = new HashMap<>();
		properties.put("dc:creator", new String[] { "" });
		properties.put("xmp:CreatorTool", "");
		properties.put("dam:Author", "");
		properties.put("dam:Producer", "");
		properties.put("pdf:Producer", "");
		properties.put("dc:rights", "");
		properties.put("dc:Rights", "");
		properties.put("photoshop:Credit", "");
		final Set<Entry<String, Object>> propertyEntries = properties.entrySet();
		for (final Entry<String, Object> propertyEntry : propertyEntries) {

			if (modifiableValueMap.containsKey(propertyEntry.getKey())) {
				modifiableValueMap.remove(propertyEntry.getKey());
			}

			modifiableValueMap.put(propertyEntry.getKey(), propertyEntry.getValue());
			log.debug("Updating property '{}' with value '{}' for resource at path '{}'.",
					propertyEntry.getKey(), propertyEntry.getValue(), assetMetadataRes.getPath());
		}
		commit(resourceResolver);
	}

 

View solution in original post

5 Replies

Avatar

Community Advisor

Hi @vpasam ,

There is no OOTB utility available as of now in AEM cloud service to scan PII data in assets.

There are multiple ways to minimize this risk like use third party tools for validation before uploading in aem . setup workflow approval process ( manual approval) .

 

Thanks

Himanshu Jain

Avatar

Correct answer by
Community Advisor

@vpasam You can write a simple workflow step to remove any PII data during the publishing process. So that asset delivered or consumed wont have the PII data. 

Below is a sample code , you can add other properties too if you know. 

 

/**
	 * The method called by the AEM Workflow Engine to perform Workflow work.
	 *
	 * @param workItem        the work item representing the resource moving through the Workflow
	 * @param workflowSession the workflow session
	 * @param args            arguments for this Workflow Process defined on the Workflow Model (PROCESS_ARGS, argSingle, argMulti)
	 * @throws WorkflowException when the Workflow Process step cannot complete. This will cause the WF to retry.
	 */
	@Override
	public void execute(WorkItem workItem, WorkflowSession workflowSession, MetaDataMap args) throws WorkflowException {
		/* Get the Workflow Payload */
		// Get the Workflow data (the data that is being passed through for this work item)

		final WorkflowData workflowData = workItem.getWorkflowData();
		final String type = workflowData.getPayloadType();
		final ResourceResolver resourceResolver = workflowSession.adaptTo(ResourceResolver.class);
		// Check if the payload is a path in the JCR; The other (less common) type is JCR_UUID
		if (!StringUtils.equals(type, TYPE_JCR_PATH)) {
			return;
		}
		// Get the path to the metadata node on the JCR resource from the payload
		final String path = getAssetPathFromPayload(workflowData);
		log.debug("MetadataCleanup Payloadpath:: {} ", path);
		Resource assetResource = resourceResolver.getResource(path);
		final Resource assetMetadataRes = assetResource.getChild("jcr:content/metadata");
		final ModifiableValueMap modifiableValueMap = assetMetadataRes.adaptTo(ModifiableValueMap.class);
		Map<String, Object> properties = new HashMap<>();
		properties.put("dc:creator", new String[] { "" });
		properties.put("xmp:CreatorTool", "");
		properties.put("dam:Author", "");
		properties.put("dam:Producer", "");
		properties.put("pdf:Producer", "");
		properties.put("dc:rights", "");
		properties.put("dc:Rights", "");
		properties.put("photoshop:Credit", "");
		final Set<Entry<String, Object>> propertyEntries = properties.entrySet();
		for (final Entry<String, Object> propertyEntry : propertyEntries) {

			if (modifiableValueMap.containsKey(propertyEntry.getKey())) {
				modifiableValueMap.remove(propertyEntry.getKey());
			}

			modifiableValueMap.put(propertyEntry.getKey(), propertyEntry.getValue());
			log.debug("Updating property '{}' with value '{}' for resource at path '{}'.",
					propertyEntry.getKey(), propertyEntry.getValue(), assetMetadataRes.getPath());
		}
		commit(resourceResolver);
	}

 

Avatar

Community Advisor

You can also possibly implement a sling filter which can filter your upload requests and then validate them as per your requirement.

Avatar

Community Advisor

You can use a custom workflow to trim/mask the metadata which might be exposing PII.
For the actual asset binary the trick is to identify the assets - post identification you can either use AEM's in-built image editor or a custom image processing library like OpenCV to blur the PII areas and then re-upload the sanitised version back to AEM.
For PDFs, libraries like Apache PDFBox or iText can be used to programmatically redact text.


For identification you can explore OCR library such as Tesseract OCR to process images and extract text for PII scanning. Once the text is extracted use regex maybe to identify the PII (Emails, Phone Numbers, Names etc.)
Similarly use libraries like Apache PDFBox (for PDFs) or Apache POI (for Word documents) to extract text from documents.
These services would have to be integrated into AEM as part of post-processing workflow to allow the asset to be processed post upload in AEM.

Hope this can help!