Skip to main content
Level 2
April 1, 2026
Solved

Batches that stuck in forever loading state while doing batch ingestion from Azure Blob storage

  • April 1, 2026
  • 2 replies
  • 72 views

Hi Folks,

@jainarundeep ​@edgar_herrera ​@sbral ​@BahadurAh ​@MatheusPa1 
Would need some suggestions on below 

We are observing several hundred batch IDs that have been stuck in a “loading” state in 2 months period, which suggests potential data loss. and it is never tracked via subscription adobe alerts

The source system successfully sent the data, however these batches never transitioned out of the loading state and appear to be stuck indefinitely. As a result, the data is not available downstream and we are currently missing critical data.

We have subscribed to dataflow ingestion alerts, but these alerts are not providing any visibility into batches that remain in the loading state, and no updates are triggered for these batch IDs.

Given that the data is now outdated and cannot be backfilled or re‑ingested, what is the most efficient and recommended approach to immediately identify, monitor, and detect batches that are stuck in the loading state?

we are using Retrieve list of batches with query params so that we can see the missing batches that are still in loading state 

Any suggestions would be appreciated 

    Best answer by AmitVishwakarma

    Hi ​@ReshmikaPu 

    Batches that sit in loading for days/weeks will never self‑recover – you must treat them as failed, and you need your own guardrail on top of the standard Flow/Alert rules to catch them early.

    1. Understand what loading really means

    For batch ingestion, loading means:

    • Files were staged, but
    • The batch was never promoted into the Data Lake (no ing_load_success or ing_load_failure event, status never moves to staging/success/failure).

    In other words, a batch that sits in loading for hours+ is effectively a failed batch, just without the "failed" flag yet. See batch lifecycle and statuses in Batch Ingestion API Overview and the monitoring guidance in Retrieving data ingestion error diagnostics. https://experienceleague.adobe.com/en/docs/experience-platform/ingestion/batch/overview https://experienceleague.adobe.com/en/docs/experience-platform/ingestion/quality/error-diagnostics

    2. Use product alerts where they apply (Flow‑level)

    For your Azure Blob Source dataflow you should still enable:

    • Sources Flow Run Delay
    • Sources Flow Run Failure
    • Sources Ingestion Error Rate Exceeded

    from Administration > Alerts as described in https://experienceleague.adobe.com/en/docs/experience-platform/observability/alerts/overview

    These will catch most connector‑level failures (wrong folder, credentials, mapping, etc.), but they do not detect every case where the Catalog batch gets stuck in loading.

    3. Add a small, custom "stuck loading" monitor (this is the key)

    This is the sustainable, long‑term pattern that actually closes the gap:

    Option A – Easiest: periodic Catalog scan

    • On a schedule (e.g. every 10–15 minutes), call Catalog Batches API: GET /data/foundation/catalog/batches?property=status==loading&property=created<={now-2h} (filter by datasetId or sandboxId if needed)
    • Treat any batch that:
      • has status = "loading"
      • and created older than your SLA (e.g. 60–120 min)
      • as "stuck" and:
        • send an internal alert (email/Slack/Teams, etc.)
        • log {imsOrgId, sandboxName, datasetId, batchId} for triage

    Option B – More advanced: use ingestion notification events

    If you prefer an event‑driven approach:

    • In Adobe Developer Console, subscribe your webhook to data ingestion notification events:
    • In your webhook logic:
      • For each new ing_load_* event, store {batchId, datasetId, createdAt}.
      • If no ing_load_success or ing_load_failure arrives for that batchId within your SLA window (e.g. 60–120 minutes), mark it as stuck in loading and alert.
      • Optionally confirm with a Catalog call (GET /catalog/batches/{batchId}) before paging anyone.

    4. Hard truth about the existing two‑month‑old loading batches

    Those specific batches will not progress by themselves:

    Given your consent‑data constraints, it's reasonable to not re‑ingest them now; instead, put the monitor in place so that future stuck batches are detected within minutes/hours, not months.

     

    2 replies

    Level 1
    April 2, 2026

    I faced a similar issue where the dataflow remained stuck in the processing state indefinitely. Based on my experience, the most reliable solution—especially considering the urgency—is to raise a P1 or P2 support ticket with Adobe. Their support team should be able to assist and resolve the issue promptly.

    Level 2
    April 2, 2026

    @BahadurAh 

    Thank you for your response. Yes, we did engage Adobe Support, and their recommendation was to delete the batch that has been stuck in a Loading state for over two months and perform a re-ingestion.

    However, this batch contains older customer data, including consent information, which has since been updated by newer records. Given that this involves thousands of records, it is not a feasible solution to manually validate and compare individual old and missed records against the latest data. From a data integrity and operational standpoint, this approach poses a significant risk and is not scalable.

    We are therefore looking for a more sustainable, long-term solution. Any guidance or best practices on how to efficiently monitor dataflows and batch processing so that critical data issues are identified early and no essential records are missed would be greatly appreciated.

    AmitVishwakarma
    Community Advisor
    AmitVishwakarmaCommunity AdvisorAccepted solution
    Community Advisor
    April 2, 2026

    Hi ​@ReshmikaPu 

    Batches that sit in loading for days/weeks will never self‑recover – you must treat them as failed, and you need your own guardrail on top of the standard Flow/Alert rules to catch them early.

    1. Understand what loading really means

    For batch ingestion, loading means:

    • Files were staged, but
    • The batch was never promoted into the Data Lake (no ing_load_success or ing_load_failure event, status never moves to staging/success/failure).

    In other words, a batch that sits in loading for hours+ is effectively a failed batch, just without the "failed" flag yet. See batch lifecycle and statuses in Batch Ingestion API Overview and the monitoring guidance in Retrieving data ingestion error diagnostics. https://experienceleague.adobe.com/en/docs/experience-platform/ingestion/batch/overview https://experienceleague.adobe.com/en/docs/experience-platform/ingestion/quality/error-diagnostics

    2. Use product alerts where they apply (Flow‑level)

    For your Azure Blob Source dataflow you should still enable:

    • Sources Flow Run Delay
    • Sources Flow Run Failure
    • Sources Ingestion Error Rate Exceeded

    from Administration > Alerts as described in https://experienceleague.adobe.com/en/docs/experience-platform/observability/alerts/overview

    These will catch most connector‑level failures (wrong folder, credentials, mapping, etc.), but they do not detect every case where the Catalog batch gets stuck in loading.

    3. Add a small, custom "stuck loading" monitor (this is the key)

    This is the sustainable, long‑term pattern that actually closes the gap:

    Option A – Easiest: periodic Catalog scan

    • On a schedule (e.g. every 10–15 minutes), call Catalog Batches API: GET /data/foundation/catalog/batches?property=status==loading&property=created<={now-2h} (filter by datasetId or sandboxId if needed)
    • Treat any batch that:
      • has status = "loading"
      • and created older than your SLA (e.g. 60–120 min)
      • as "stuck" and:
        • send an internal alert (email/Slack/Teams, etc.)
        • log {imsOrgId, sandboxName, datasetId, batchId} for triage

    Option B – More advanced: use ingestion notification events

    If you prefer an event‑driven approach:

    • In Adobe Developer Console, subscribe your webhook to data ingestion notification events:
    • In your webhook logic:
      • For each new ing_load_* event, store {batchId, datasetId, createdAt}.
      • If no ing_load_success or ing_load_failure arrives for that batchId within your SLA window (e.g. 60–120 minutes), mark it as stuck in loading and alert.
      • Optionally confirm with a Catalog call (GET /catalog/batches/{batchId}) before paging anyone.

    4. Hard truth about the existing two‑month‑old loading batches

    Those specific batches will not progress by themselves:

    Given your consent‑data constraints, it's reasonable to not re‑ingest them now; instead, put the monitor in place so that future stuck batches are detected within minutes/hours, not months.

     

    Amit Vishwakarma - Adobe Commerce Champion 2025 | 16x Adobe certified | 4x Adobe SME
    bjoern__koth
    Community Advisor and Adobe Champion
    Community Advisor and Adobe Champion
    April 3, 2026

    We're also facing a similar issue at a client where a Google cloud storage job never finishes, whereas the same file can be imported without problems through DLZ. Still waiting for engineering to come back to us. 

    Cheers from Switzerland!