Expand my Community achievements bar.

SOLVED

Find urls in nodes property using regex?

Avatar

Level 1

I've got a request to replace instances of internal 'http' urls to 'https' inside rich text components.

So I'm using a SQL2 query using CONTAINS to find all nodes that have the text "http://%mysite.com" inside the text property

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND CONTAINS(node.text, 'http://%mysite.com')

This is working but is also matching cases where I have a external link followed by the text "mysite.com" further down.

For example:

 

"Lorem ipsum dolor sit amet, consectetur adipiscing elit <a href="http://mysite.com">internal link</a> Duis viverra orci a ullamcorper mattis." - MATCHES 

 

"Lorem ipsum dolor sit amet, consectetur adipiscing elit <a href="http://wikipedia.com">external link</a> Duis viverra orci a ullamcorper <a href="http://mysite.com">internal link</a> mattis." - ALSO MATCHES

 

"Lorem ipsum dolor sit amet, consectetur adipiscing elit <a href="http://wikipedia.com">external link</a> Duis viverra orci a ullamcorper mysite.com mattis." - ALSO MATCHES

 

So I would like to know if there is a way to modify the LIKE pattern or use REGEX instead to only find internal urls inside a long string.

Topics

Topics help categorize Community content and increase your ability to discover relevant content.

1 Accepted Solution

Avatar

Correct answer by
Level 6

Hi @csamaniego97 since your query use CONTAINS(node.text, 'http://%mysite.com') it is matching the external links as well because it literally means http://..(anything)...mysite.com i.e http:// from the external link part then anything then mysite.com text.

For simpler approach you can use:

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND CONTAINS(node.text, '<a href="http://%mysite\.com[^"]*')

 



You can use a more specified regular expression like this:

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND REGEX_CONTAINS(node.text, '(?i)http://([^"]+\.)?mysite\.com[^<]*', 'm')

 

or even more specific

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND REGEX_CONTAINS(node.text, '(?i)http://(?:([^"]+\.)?mysite\.com|(?:[^<]+))(?<=href=")', 'm')

 


You can also try using the LIKE with a wildcard operator:

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND node.text LIKE '%http://%mysite.com%[^<]*%'

 

or even more specific

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND node.text LIKE '%<a href="http://%[^<]*mysite\.com[^"]*%[^>]*%'

 

 

View solution in original post

2 Replies

Avatar

Community Advisor

Hi @csamaniego97 

 

Try this pattern http://mysite.com%  as % symbol matches with any characters. 

 

Thanks,

Lokesh

Avatar

Correct answer by
Level 6

Hi @csamaniego97 since your query use CONTAINS(node.text, 'http://%mysite.com') it is matching the external links as well because it literally means http://..(anything)...mysite.com i.e http:// from the external link part then anything then mysite.com text.

For simpler approach you can use:

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND CONTAINS(node.text, '<a href="http://%mysite\.com[^"]*')

 



You can use a more specified regular expression like this:

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND REGEX_CONTAINS(node.text, '(?i)http://([^"]+\.)?mysite\.com[^<]*', 'm')

 

or even more specific

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND REGEX_CONTAINS(node.text, '(?i)http://(?:([^"]+\.)?mysite\.com|(?:[^<]+))(?<=href=")', 'm')

 


You can also try using the LIKE with a wildcard operator:

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND node.text LIKE '%http://%mysite.com%[^<]*%'

 

or even more specific

 

SELECT * FROM [nt:unstructured] AS node
WHERE ISDESCENDANTNODE('/content/mysite')
AND node.text LIKE '%<a href="http://%[^<]*mysite\.com[^"]*%[^>]*%'