Expand my Community achievements bar.

Content quality checks

Avatar

Community Advisor

The majority of our assets are ingested through RESTful APIs, and often the metadata fails to meet compliance standards. I am seeking recommendations for frameworks or approach that can be utilized to

  • validate metadata via RESTful services.
  • perform regular content quality checks for existing assets.

Aanchal Sikka

7 Replies

Avatar

Level 5

Hi @aanchal-sikka 

 

Is it possible to expand a little more the subject ? How does your current import process looks like ? How do all the parties involved in the ingestion integrate ? When and where do you expect the regular quality checks on the content to happen ? What does it mean for you that a metadata compliance is failing ? Some diagrams or screenshot also would not hurt.

Avatar

Community Advisor

@Tethich 

 

Thanks for the queries. Sharing details below:

  • The import is done by AEM by reading the Asset & related metadata from a location.
  • Integration is a pull mechanism from AEM, by reading S3 buckets.
  • Checks can be at Asset Import:
    • Assets to be rejected if severe compliance issue in metadata
    • Assets accepted with minor compliance issues in metadata. These should be reported by regular health checks.

Non-compliance can be if metadata is from a specific set of values, format etc.


Aanchal Sikka

Avatar

Community Advisor

Hi @aanchal-sikka 

 

For the regular scheduled quality check, it can be a scheduled job which can get the list of updated assets in a time period. This will perform the specified checks and then add/update a field in metadata that keeps track of the last review date.

 

For the API to fetch metadata of assets, you can explore this - https://developer.adobe.com/experience-cloud/experience-manager-apis/api/experimental/assets/author/

 

Hope this helps!

 

Narendra

 

Avatar

Level 5

In am thinking that you might lift some of the burden from AEM and have a separate app, maybe built with microservices, that runs periodically,  checks the objects metadatas in Amazon S3 based on your criteria and marks the metadata accordingly. So that when AEM will pull the stuff it will already know if data was validated or not before ingesting it. 

 

Some good ways to implement a validator were already posted here.

Avatar

Level 2

To ensure metadata compliance and perform regular content quality checks for assets ingested via RESTful APIs, you can use a combination of tools and methodologies. Here's a structured approach to help you:


1. Validate Metadata via RESTful Services

Implement a robust metadata validation framework using the following tools and techniques:

a. Schema Validation

  • Use JSON Schema or XML Schema Definition (XSD) to define metadata standards.
  • Validate metadata payloads using libraries like:
    • For JSON:
      • Ajv (Node.js)
      • Jackson (Java)
      • FastAPI Pydantic models (Python)
    • For XML:
      • JAXB (Java)
      • lxml (Python)
  • Integrate validation in API middleware to reject non-compliant metadata during ingestion.

b. API Contracts

  • Define API contracts with tools like OpenAPI/Swagger or Postman.
  • Use tools such as Prism or Postman Tests to simulate and validate API requests and responses.

c. Custom Rules Engine

  • Build a rules engine for dynamic validation:
    • Use libraries like Drools (Java), json-rules-engine (JavaScript), or custom scripts.
    • Define rules for required fields, value ranges, and relationships between metadata fields.

2. Regular Content Quality Checks

For existing assets, implement regular checks using automated tools and manual reviews:

a. Automated Metadata Audits

  • Write scripts or use tools like:
    • Apache Tika: Extract metadata from assets and compare with standards.
    • ElasticSearch/Kibana: Query and analyze metadata stored in indices.
    • Custom Python Scripts: Use libraries like pandas or sqlalchemy to audit metadata.

b. Data Quality Frameworks

  • Tools like Great Expectations (Python) or dbt (Data Build Tool) can validate data and enforce metadata compliance rules.
  • Regularly schedule checks via CI/CD pipelines or cron jobs.

c. Reporting and Alerts

  • Set up automated reports and dashboards (e.g., Tableau, Power BI, or Grafana) to monitor compliance levels.
  • Trigger alerts using notification systems like Slack, PagerDuty, or email when discrepancies are detected.

3. Metadata Enhancement

Use AI/ML or rule-based systems to improve metadata quality:

  • Tools:
    • Amazon Rekognition, Google Vision API, or Azure Computer Vision for auto-tagging and metadata generation.
  • Integrate these tools to fill gaps in metadata post-ingestion.

4. Workflow Integration

  • Integrate quality checks in your content lifecycle management system (e.g., AEM, Drupal, or custom DAMs).
  • Use AEM’s built-in tools for metadata extraction and validation (if applicable).

5. Governance and Compliance

  • Define metadata governance policies and standards (e.g., Dublin Core, IPTC).
  • Train teams and enforce compliance with clear guidelines and regular reviews.

Example Architecture

  1. API Gateway: Validate metadata during ingestion.
  2. Metadata Repository: Store and index metadata in ElasticSearch or a relational database.
  3. Validation Layer: Periodic checks with automated scripts and tools like Great Expectations.
  4. Dashboard: Visualize compliance trends and issues.
  5. Notifications: Automated alerts for discrepancies.

6. Tools to Consider

  • Validation:
    • Postman, Ajv, JSON Schema Validator, XML Validator
  • Automation:
    • Great Expectations, dbt
  • Metadata Management:
    • Apache Tika, Adobe Experience Manager (AEM)
  • Dashboards:
    • Power BI, Tableau, Grafana
  • Logging and Alerts:
    • ElasticStack, Splunk

By combining API-level validations, automated audits, and a governance framework, you can ensure consistent metadata quality and compliance for your assets.

Avatar

Community Advisor

Hello @aanchal-sikka ,

I hope you're doing well.

 

  • Given the additional computational overhead, it might be better to perform validation or sanity checks outside of AEM, prior to asset ingestion, if feasible.

  • If the compliance violations occur when assets are exposed to traffic from the publisher, could we consider using an asset replication interceptor (such as a replication preprocessor) to validate and allow only compliant assets to be replicated?

  • To monitor faulty assets, we could set up a scheduled Sling job to generate reports that identify non-compliant entries.

Let me know if this approach makes sense

Avatar

Community Advisor

Hi @aanchal-sikka -

 

My thoughts -

From what you shared below, you have a custom process setup in AEM that reads assets and metadata from a S3 location. Are you referring to the issues within this metadata that is stored separately or the ones like, xmp metadata that are extracted out of an asset? 

If it is the separately managed metadata - Can you not validate or run your compliance check during the ingestion phase in your custom process?

If it is the xmp metadata - you will need to create a custom process and configure it to be invoked as part of a DAM metadata writeback workflow itself or as a separate scheduler as per the need.

 

Regards,

Fani