Expand my Community achievements bar.

Dive in, experiment, and see how our AI Assistant Content Accelerator can transform your workflows with personalized, efficient content solutions through our newly designed playground experience.
SOLVED

Query to validate valid emails.

Avatar

Level 1

I am creating a query to obtain a list of emails from a dataset that is contaminated with invalid email formats, I have a query with many filters but there are many that cannot obtain the filters and when using the programmed query it generates an error. ingestion into the dataset. (Format validation for field :personalEmail.address of type: String failed. The input value is daimer100@gmail.conm and expected format was: email. Please update the field and try again.) The list of errors is quite extensive and varied, not I see it viable to do this through query filters, how else could I do this query?

QUERY:

WITH valid_emails AS (
    SELECT
        uuid() AS _id,
        CASE
            WHEN identityMap['email'][0]['id'] LIKE '%_@__%.__%' 
                 AND identityMap['email'][0]['id'] ~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$' -- Validación RFC2822  
                 AND NOT (identityMap['email'][0]['id'] LIKE '%..%') -- No contiene puntos consecutivos
                 AND NOT (identityMap['email'][0]['id'] LIKE '%.@%') -- No contiene un punto antes de la arroba
                 AND NOT (identityMap['email'][0]['id'] LIKE '%.%@%') -- No contiene un punto antes o después de la arroba
                 AND NOT (identityMap['email'][0]['id'] LIKE '%@%.') -- No contiene un punto después de la arroba
                 AND NOT (identityMap['email'][0]['id'] LIKE '%.con') -- No termina en .con
                 AND NOT (identityMap['email'][0]['id'] LIKE '%/%') -- No contiene una barra inclinada
                 THEN named_struct('address', identityMap['email'][0]['id'])
            WHEN identityMap['email'][1]['id'] LIKE '%_@__%.__%' 
                 AND identityMap['email'][1]['id'] ~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$' -- Validación RFC2822 
                 AND NOT (identityMap['email'][1]['id'] LIKE '%..%') -- No contiene puntos consecutivos
                 AND NOT (identityMap['email'][1]['id'] LIKE '%.@%') -- No contiene un punto antes de la arroba
                 AND NOT (identityMap['email'][1]['id'] LIKE '%.%@%') -- No contiene un punto antes o después de la arroba
                 AND NOT (identityMap['email'][1]['id'] LIKE '%@%.') -- No contiene un punto después de la arroba
                 AND NOT (identityMap['email'][1]['id'] LIKE '%.con') -- No termina en .con
                 AND NOT (identityMap['email'][1]['id'] LIKE '%/%') -- No contiene una barra inclinada
                 THEN named_struct('address', identityMap['email'][1]['id'])
            WHEN identityMap['email'][2]['id'] LIKE '%_@__%.__%' 
                 AND identityMap['email'][2]['id'] ~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$' -- Validación RFC2822
                 AND NOT (identityMap['email'][2]['id'] LIKE '%..%') -- No contiene puntos consecutivos
                 AND NOT (identityMap['email'][2]['id'] LIKE '%.@%') -- No contiene un punto antes de la arroba
                 AND NOT (identityMap['email'][2]['id'] LIKE '%.%@%') -- No contiene un punto antes o después de la arroba
                 AND NOT (identityMap['email'][2]['id'] LIKE '%@%.') -- No contiene un punto después de la arroba
                 AND NOT (identityMap['email'][2]['id'] LIKE '%.con') -- No termina en .con
                 AND NOT (identityMap['email'][2]['id'] LIKE '%/%') -- No contiene una barra inclinada
                 THEN named_struct('address', identityMap['email'][2]['id'])
            -- Agrega más condiciones si hay más posiciones
            ELSE NULL
        END AS personalEmail
    FROM
        profile_snapshot_export_*profile_snapshot_number*
    WHERE
        segmentmembership['ups']['*segmentmembership NUMBER*'].status IN ('realized', 'existing')
        AND (identityMap['email'][0]['id'] IS NOT NULL 
             OR identityMap['email'][1]['id'] IS NOT NULL
             OR identityMap['email'][2]['id'] IS NOT NULL -- Agrega más condiciones si hay más posiciones
            )
)
SELECT 
    _id,
    personalEmail 
FROM 
    valid_emails
WHERE 
    personalEmail IS NOT null;



1 Accepted Solution

Avatar

Correct answer by
Community Advisor

@Juan_SebastianQu 

Adobe is using IANA TLD to validate and insert email in AJO when using the personalEmail.address

 

TLD stands for Top Level Domain commonly known as extension .com .info .org etc

 

During the insertion, it's checking the format (email regex or %@%.% to make it easier) and then the extension. If the extension is not in IANA TLD then your email will fail the validation e.g .conm

 

To reduce the failures, I will suggest you to implement a email regex validation at the collection point.

 

Thanks,

David

 



David Kangni

View solution in original post

4 Replies

Avatar

Community Advisor

Hi @Juan_SebastianQu ,

 

Well, I tried with a complex regex pattern and it failed. In many SQL implementations,  especially those used in services like Adobe Experience Platform (AEP), more complex/advanced regex functionality might not be supported directly in queries. We will need to use basic SQL pattern matching with LIKE 
We have to keep adding multiple patterns to get all invalid email formats instead of complex regEx like - email !~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'

Pradeep_Kumar_Srivastav_0-1722389663189.png

 

Avatar

Community Advisor

@Juan_SebastianQu I would suggest consider creating a query to retrieve emails from a dataset excluding domains such as "gmail.com" and "yahoo.com". The criteria should focus on identifying fewer valid domains in comparison to the N number of invalid domains.

Thanks, Sathees

Avatar

Correct answer by
Community Advisor

@Juan_SebastianQu 

Adobe is using IANA TLD to validate and insert email in AJO when using the personalEmail.address

 

TLD stands for Top Level Domain commonly known as extension .com .info .org etc

 

During the insertion, it's checking the format (email regex or %@%.% to make it easier) and then the extension. If the extension is not in IANA TLD then your email will fail the validation e.g .conm

 

To reduce the failures, I will suggest you to implement a email regex validation at the collection point.

 

Thanks,

David

 



David Kangni

Avatar

Administrator

Hi @Juan_SebastianQu,

Were you able to resolve this query with the help of the given solutions or do you still need more help here? Do let us know. In case the given solutions were helpful, then kindly choose the one that helped you the most as the 'Correct Reply'.
Thanks!



Sukrity Wadhwa