Find urls in nodes property using regex? | Community
Skip to main content
July 17, 2024
Solved

Find urls in nodes property using regex?

  • July 17, 2024
  • 2 replies
  • 1006 views

I've got a request to replace instances of internal 'http' urls to 'https' inside rich text components.

So I'm using a SQL2 query using CONTAINS to find all nodes that have the text "http://%mysite.com" inside the text property

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND CONTAINS(node.text, 'http://%mysite.com')

This is working but is also matching cases where I have a external link followed by the text "mysite.com" further down.

For example:

 

"Lorem ipsum dolor sit amet, consectetur adipiscing elit <a href="http://mysite.com">internal link</a> Duis viverra orci a ullamcorper mattis." - MATCHES 

 

"Lorem ipsum dolor sit amet, consectetur adipiscing elit <a href="http://wikipedia.com">external link</a> Duis viverra orci a ullamcorper <a href="http://mysite.com">internal link</a> mattis." - ALSO MATCHES

 

"Lorem ipsum dolor sit amet, consectetur adipiscing elit <a href="http://wikipedia.com">external link</a> Duis viverra orci a ullamcorper mysite.com mattis." - ALSO MATCHES

 

So I would like to know if there is a way to modify the LIKE pattern or use REGEX instead to only find internal urls inside a long string.

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by kapil_rajoria

Hi @csamaniego97 since your query use CONTAINS(node.text, 'http://%mysite.com') it is matching the external links as well because it literally means http://..(anything)...mysite.com i.e http:// from the external link part then anything then mysite.com text.

For simpler approach you can use:

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND CONTAINS(node.text, '<a href="http://%mysite\.com[^"]*')

 



You can use a more specified regular expression like this:

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND REGEX_CONTAINS(node.text, '(?i)http://([^"]+\.)?mysite\.com[^<]*', 'm')

 

or even more specific

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND REGEX_CONTAINS(node.text, '(?i)http://(?:([^"]+\.)?mysite\.com|(?:[^<]+))(?<=href=")', 'm')

 


You can also try using the LIKE with a wildcard operator:

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND node.text LIKE '%http://%mysite.com%[^<]*%'

 

or even more specific

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND node.text LIKE '%<a href="http://%[^<]*mysite\.com[^"]*%[^>]*%'

 

 

2 replies

Lokesh_Vajrala
Community Advisor
Community Advisor
July 18, 2024

Hi @csamaniego97 

 

Try this pattern http://mysite.com%  as % symbol matches with any characters. 

 

Thanks,

Lokesh

kapil_rajoria
Community Advisor
kapil_rajoriaCommunity AdvisorAccepted solution
Community Advisor
July 18, 2024

Hi @csamaniego97 since your query use CONTAINS(node.text, 'http://%mysite.com') it is matching the external links as well because it literally means http://..(anything)...mysite.com i.e http:// from the external link part then anything then mysite.com text.

For simpler approach you can use:

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND CONTAINS(node.text, '<a href="http://%mysite\.com[^"]*')

 



You can use a more specified regular expression like this:

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND REGEX_CONTAINS(node.text, '(?i)http://([^"]+\.)?mysite\.com[^<]*', 'm')

 

or even more specific

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND REGEX_CONTAINS(node.text, '(?i)http://(?:([^"]+\.)?mysite\.com|(?:[^<]+))(?<=href=")', 'm')

 


You can also try using the LIKE with a wildcard operator:

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND node.text LIKE '%http://%mysite.com%[^<]*%'

 

or even more specific

 

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE('/content/mysite') AND node.text LIKE '%<a href="http://%[^<]*mysite\.com[^"]*%[^>]*%'