I am creating a query to obtain a list of emails from a dataset that is contaminated with invalid email formats, I have a query with many filters but there are many that cannot obtain the filters and when using the programmed query it generates an error. ingestion into the dataset. (Format validation for field :personalEmail.address of type: String failed. The input value is daimer100@gmail.conm and expected format was: email. Please update the field and try again.) The list of errors is quite extensive and varied, not I see it viable to do this through query filters, how else could I do this query?
QUERY:
WITH valid_emails AS (
SELECT
uuid() AS _id,
CASE
WHEN identityMap['email'][0]['id'] LIKE '%_@__%.__%'
AND identityMap['email'][0]['id'] ~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$' -- Validación RFC2822
AND NOT (identityMap['email'][0]['id'] LIKE '%..%') -- No contiene puntos consecutivos
AND NOT (identityMap['email'][0]['id'] LIKE '%.@%') -- No contiene un punto antes de la arroba
AND NOT (identityMap['email'][0]['id'] LIKE '%.%@%') -- No contiene un punto antes o después de la arroba
AND NOT (identityMap['email'][0]['id'] LIKE '%@%.') -- No contiene un punto después de la arroba
AND NOT (identityMap['email'][0]['id'] LIKE '%.con') -- No termina en .con
AND NOT (identityMap['email'][0]['id'] LIKE '%/%') -- No contiene una barra inclinada
THEN named_struct('address', identityMap['email'][0]['id'])
WHEN identityMap['email'][1]['id'] LIKE '%_@__%.__%'
AND identityMap['email'][1]['id'] ~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$' -- Validación RFC2822
AND NOT (identityMap['email'][1]['id'] LIKE '%..%') -- No contiene puntos consecutivos
AND NOT (identityMap['email'][1]['id'] LIKE '%.@%') -- No contiene un punto antes de la arroba
AND NOT (identityMap['email'][1]['id'] LIKE '%.%@%') -- No contiene un punto antes o después de la arroba
AND NOT (identityMap['email'][1]['id'] LIKE '%@%.') -- No contiene un punto después de la arroba
AND NOT (identityMap['email'][1]['id'] LIKE '%.con') -- No termina en .con
AND NOT (identityMap['email'][1]['id'] LIKE '%/%') -- No contiene una barra inclinada
THEN named_struct('address', identityMap['email'][1]['id'])
WHEN identityMap['email'][2]['id'] LIKE '%_@__%.__%'
AND identityMap['email'][2]['id'] ~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$' -- Validación RFC2822
AND NOT (identityMap['email'][2]['id'] LIKE '%..%') -- No contiene puntos consecutivos
AND NOT (identityMap['email'][2]['id'] LIKE '%.@%') -- No contiene un punto antes de la arroba
AND NOT (identityMap['email'][2]['id'] LIKE '%.%@%') -- No contiene un punto antes o después de la arroba
AND NOT (identityMap['email'][2]['id'] LIKE '%@%.') -- No contiene un punto después de la arroba
AND NOT (identityMap['email'][2]['id'] LIKE '%.con') -- No termina en .con
AND NOT (identityMap['email'][2]['id'] LIKE '%/%') -- No contiene una barra inclinada
THEN named_struct('address', identityMap['email'][2]['id'])
-- Agrega más condiciones si hay más posiciones
ELSE NULL
END AS personalEmail
FROM
profile_snapshot_export_*profile_snapshot_number*
WHERE
segmentmembership['ups']['*segmentmembership NUMBER*'].status IN ('realized', 'existing')
AND (identityMap['email'][0]['id'] IS NOT NULL
OR identityMap['email'][1]['id'] IS NOT NULL
OR identityMap['email'][2]['id'] IS NOT NULL -- Agrega más condiciones si hay más posiciones
)
)
SELECT
_id,
personalEmail
FROM
valid_emails
WHERE
personalEmail IS NOT null;
Solved! Go to Solution.
Views
Replies
Total Likes
Adobe is using IANA TLD to validate and insert email in AJO when using the personalEmail.address
TLD stands for Top Level Domain commonly known as extension .com .info .org etc
During the insertion, it's checking the format (email regex or %@%.% to make it easier) and then the extension. If the extension is not in IANA TLD then your email will fail the validation e.g .conm
To reduce the failures, I will suggest you to implement a email regex validation at the collection point.
Thanks,
David
Views
Replies
Total Likes
Hi @Juan_SebastianQu ,
Well, I tried with a complex regex pattern and it failed. In many SQL implementations, especially those used in services like Adobe Experience Platform (AEP), more complex/advanced regex functionality might not be supported directly in queries. We will need to use basic SQL pattern matching with LIKE
We have to keep adding multiple patterns to get all invalid email formats instead of complex regEx like - email !~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'
Views
Replies
Total Likes
@Juan_SebastianQu I would suggest consider creating a query to retrieve emails from a dataset excluding domains such as "gmail.com" and "yahoo.com". The criteria should focus on identifying fewer valid domains in comparison to the N number of invalid domains.
Views
Replies
Total Likes
Adobe is using IANA TLD to validate and insert email in AJO when using the personalEmail.address
TLD stands for Top Level Domain commonly known as extension .com .info .org etc
During the insertion, it's checking the format (email regex or %@%.% to make it easier) and then the extension. If the extension is not in IANA TLD then your email will fail the validation e.g .conm
To reduce the failures, I will suggest you to implement a email regex validation at the collection point.
Thanks,
David
Views
Replies
Total Likes
Were you able to resolve this query with the help of the given solutions or do you still need more help here? Do let us know. In case the given solutions were helpful, then kindly choose the one that helped you the most as the 'Correct Reply'.
Thanks!
Views
Replies
Total Likes
Views
Likes
Replies