Expand my Community achievements bar.

Issue while migrating HTML from JSON object into text component in AEM

Avatar

Level 3

hi ,

 

I am creating a migration tool in Java , to migrate content from another CMS into AEM pages. The data is in JSON format which consists of some metadata and HTML areas. 

I am converting the HTML areas into text component after migration in AEM , however the data is not displaying properly on the page. 

I see some weird characters after decoding the HTML in UTF-8  like �?? 

Is there a way to handle this in java ? 

I tried using StringEscapeUtils and JSOUP without any success.

 

Thanks,

Sam

7 Replies

Avatar

Level 6

By which means are you getting the data? If it is a servlet then you can set the encoding format like below:

response.setCharacterEncoding("UTF-8");

 

Again if you are displaying some text and see some html tags in it which should not come, then you need to add @CONTEXT='html' just after the text content.

 

If the above doesn't help let me know how you are getting the json from the java code.

Avatar

Level 3

hi Ibshikha ,

 

Can you provide details about adding @CONTEXT = html to the text content , is there a way of adding it programmtically in the JCR?

As I am building the pages programmatically...

 

Yes , I am using a servlet and I have already set the Character Encoding (UTF-8) at the response level, however this does not resolve the issue. I am using the below to fetch the JSON from API .

StringBuilder json = new StringBuilder();
url = new URL(src);
URLConnection tc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()),"");
String line1 = in.readLine();
json.append(line1);
return json.toString().trim();
The above method is used to fetch the JSON into StringBuffer from API. The URL is passed to the URL

 

Please let me know if you have a solution to try.

 

Thanks,

Samiksha

Avatar

Level 6
Hi samikshaa223429, I can think of the below things: 1. Check the encoding type in both the systems i.e. in AEM and also in the one from which you are getting the data. 2. Check the data/text that you are receiving. If that is already encoded, then that needs to be decoded using the correct encoding type and encoded again using the one in AEM. In AEM you can check the encoding in this config: Apache Sling Request Parameter Handling. @CONTEXT=html is used in the sightly to dispaly the text content as html. What I understood from your statement is it should not be related to the context. Also I would like to request you to debug and see at what point you are getting the weird characters. Is it when you fetch the text from the 3rd party or after you save it in AEM nodes?

Avatar

Level 4

Hello @samikshaa223429 ,

 

As per my understating, there are some special un-standard characters in your HTML areas that are non-ASCII chars. You need to remove that special character while converting from HTML area to text component text by replacing all the non-ASCII chars to empty. Something like below,

 

     String  textContent = htmlContent.replaceAll("[^\\p{ASCII}]", "");

 

Thanks,

Venkat.M

 

Avatar

Level 3
hi Bimmi, Currently I am storing it in a AEM Core Text component , so don't have any sightly code for rendering the HTML.