Batches that stuck in forever loading state while doing batch ingestion from Azure Blob storage

Question

Hi Folks,​@jainarundeep ​@edgar_herrera ​@sbral ​@BahadurAh ​@MatheusPa1 Would need some suggestions on below We are observing several hundred batch IDs that have been stuck in a “loading” state in 2 months period, which suggests potential data loss. and it is never tracked via subscription adobe alertsThe source system successfully sent the data, however these batches never transitioned out of the loading state and appear to be stuck indefinitely. As a result, the data is not available downstream and we are currently missing critical data.We have subscribed to dataflow ingestion alerts, but these alerts are not providing any visibility into batches that remain in the loading state, and no updates are triggered for these batch IDs.Given that the data is now outdated and cannot be backfilled or re‑ingested, what is the most efficient and recommended approach to immediately identify, monitor, and detect batches that are stuck in the loading state?we are using Retrieve list of batches with query params so that we can see the missing batches that are still in loading state Any suggestions would be appreciated

AmitVishwakarma · Accepted Answer

Hi ​@ReshmikaPuBatches that sit in loading for days/weeks will never self‑recover – you must treat them as failed, and you need your own guardrail on top of the standard Flow/Alert rules to catch them early.1. Understand what loading really meansFor batch ingestion, loading means:Files were staged, butThe batch was never promoted into the Data Lake (no ing_load_success or ing_load_failure event, status never moves to staging/success/failure).In other words, a batch that sits in loading for hours+ is effectively a failed batch, just without the "failed" flag yet. See batch lifecycle and statuses in Batch Ingestion API Overview and the monitoring guidance in Retrieving data ingestion error diagnostics.https://experienceleague.adobe.com/en/docs/experience-platform/ingestion/batch/overviewhttps://experienceleague.adobe.com/en/docs/experience-platform/ingestion/quality/error-diagnostics2. Use product alerts where they apply (Flow‑level)For your Azure Blob Source dataflow you should still enable:Sources Flow Run DelaySources Flow Run FailureSources Ingestion Error Rate Exceededfrom Administration > Alerts as described inhttps://experienceleague.adobe.com/en/docs/experience-platform/observability/alerts/overviewThese will catch most connector‑level failures (wrong folder, credentials, mapping, etc.), but they do not detect every case where the Catalog batch gets stuck in loading.3. Add a small, custom "stuck loading" monitor (this is the key)This is the sustainable, long‑term pattern that actually closes the gap:Option A – Easiest: periodic Catalog scanOn a schedule (e.g. every 10–15 minutes), call Catalog Batches API:GET /data/foundation/catalog/batches?property=status==loading&property=created<={now-2h} (filter by datasetId or sandboxId if needed)Treat any batch that:has status = "loading"and created older than your SLA (e.g. 60–120 min)as "stuck" and:send an internal alert (email/Slack/Teams, etc.)log {imsOrgId, sandboxName, datasetId, batchId} for triageOption B – More advanced: use ingestion notification eventsIf you prefer an event‑driven approach:In Adobe Developer Console, subscribe your webhook to data ingestion notification events:ing_load_success and ing_load_failure for Data Lake batches (and optionally ps_load_*, ig_load_* for Profile/Identity) as described inhttps://experienceleague.adobe.com/en/docs/experience-platform/ingestion/quality/error-diagnosticsIn your webhook logic:For each new ing_load_* event, store {batchId, datasetId, createdAt}.If no ing_load_success or ing_load_failure arrives for that batchId within your SLA window (e.g. 60–120 minutes), mark it as stuck in loading and alert.Optionally confirm with a Catalog call (GET /catalog/batches/{batchId}) before paging anyone.4. Hard truth about the existing two‑month‑old loading batchesThose specific batches will not progress by themselves:They have effectively failed; the platform will not "resume" them after weeks.The official support recommendation (delete/re‑ingest) is aligned with how batch lifecycle works inhttps://experienceleague.adobe.com/en/docs/experience-platform/ingestion/batch/overviewGiven your consent‑data constraints, it's reasonable to not re‑ingest them now; instead, put the monitor in place so that future stuck batches are detected within minutes/hours, not months.

BahadurAh · Answer

I faced a similar issue where the dataflow remained stuck in the processing state indefinitely. Based on my experience, the most reliable solution—especially considering the urgency—is to raise a P1 or P2 support ticket with Adobe. Their support team should be able to assist and resolve the issue promptly.

Sign up

Login with SSO

Login to the community

Login with SSO