Issue while migrating HTML from JSON object into text component in AEM

Avatar

Avatar
Validate 1
Level 1
samikshaa223429
Level 1

Likes

0 likes

Total Posts

35 posts

Correct reply

0 solutions
Top badges earned
Validate 1
Give Back
View profile

Avatar
Validate 1
Level 1
samikshaa223429
Level 1

Likes

0 likes

Total Posts

35 posts

Correct reply

0 solutions
Top badges earned
Validate 1
Give Back
View profile
samikshaa223429
Level 1

24-06-2021

hi ,

 

I am creating a migration tool in Java , to migrate content from another CMS into AEM pages. The data is in JSON format which consists of some metadata and HTML areas. 

I am converting the HTML areas into text component after migration in AEM , however the data is not displaying properly on the page. 

I see some weird characters after decoding the HTML in UTF-8  like �?? 

Is there a way to handle this in java ? 

I tried using StringEscapeUtils and JSOUP without any success.

 

Thanks,

Sam

Accepted Solutions (0)

Answers (3)

Answers (3)

Avatar

Avatar
Affirm 25
Level 5
Bimmi_Soi
Level 5

Likes

73 likes

Total Posts

81 posts

Correct reply

26 solutions
Top badges earned
Affirm 25
Contributor
Applaud 5
Boost 50
Ignite 1
View profile

Avatar
Affirm 25
Level 5
Bimmi_Soi
Level 5

Likes

73 likes

Total Posts

81 posts

Correct reply

26 solutions
Top badges earned
Affirm 25
Contributor
Applaud 5
Boost 50
Ignite 1
View profile
Bimmi_Soi
Level 5

24-06-2021

@samikshaa223429 : Use @CONTEXT='html'  in your sightly.

Avatar

Avatar
Boost 5
Level 3
vmadala
Level 3

Likes

21 likes

Total Posts

28 posts

Correct reply

5 solutions
Top badges earned
Boost 5
Establish
Boost 3
Boost 1
Affirm 1
View profile

Avatar
Boost 5
Level 3
vmadala
Level 3

Likes

21 likes

Total Posts

28 posts

Correct reply

5 solutions
Top badges earned
Boost 5
Establish
Boost 3
Boost 1
Affirm 1
View profile
vmadala
Level 3

24-06-2021

Hello @samikshaa223429 ,

 

As per my understating, there are some special un-standard characters in your HTML areas that are non-ASCII chars. You need to remove that special character while converting from HTML area to text component text by replacing all the non-ASCII chars to empty. Something like below,

 

     String  textContent = htmlContent.replaceAll("[^\\p{ASCII}]", "");

 

Thanks,

Venkat.M

 

Avatar

Avatar
Boost 25
Level 3
ibishika
Level 3

Likes

30 likes

Total Posts

29 posts

Correct reply

3 solutions
Top badges earned
Boost 25
Affirm 3
Boost 10
Boost 5
Give Back
View profile

Avatar
Boost 25
Level 3
ibishika
Level 3

Likes

30 likes

Total Posts

29 posts

Correct reply

3 solutions
Top badges earned
Boost 25
Affirm 3
Boost 10
Boost 5
Give Back
View profile
ibishika
Level 3

24-06-2021

By which means are you getting the data? If it is a servlet then you can set the encoding format like below:

response.setCharacterEncoding("UTF-8");

 

Again if you are displaying some text and see some html tags in it which should not come, then you need to add @CONTEXT='html' just after the text content.

 

If the above doesn't help let me know how you are getting the json from the java code.

samikshaa223429

hi Ibshikha ,

 

Can you provide details about adding @CONTEXT = html to the text content , is there a way of adding it programmtically in the JCR?

As I am building the pages programmatically...

 

Yes , I am using a servlet and I have already set the Character Encoding (UTF-8) at the response level, however this does not resolve the issue. I am using the below to fetch the JSON from API .

StringBuilder json = new StringBuilder();
url = new URL(src);
URLConnection tc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()),"");
String line1 = in.readLine();
json.append(line1);
return json.toString().trim();
The above method is used to fetch the JSON into StringBuffer from API. The URL is passed to the URL

 

Please let me know if you have a solution to try.

 

Thanks,

Samiksha

ibishika
Hi samikshaa223429, I can think of the below things: 1. Check the encoding type in both the systems i.e. in AEM and also in the one from which you are getting the data. 2. Check the data/text that you are receiving. If that is already encoded, then that needs to be decoded using the correct encoding type and encoded again using the one in AEM. In AEM you can check the encoding in this config: Apache Sling Request Parameter Handling. @CONTEXT=html is used in the sightly to dispaly the text content as html. What I understood from your statement is it should not be related to the context. Also I would like to request you to debug and see at what point you are getting the weird characters. Is it when you fetch the text from the 3rd party or after you save it in AEM nodes?