Expand my Community achievements bar.

SOLVED

Get english version pages list

Avatar

Level 8

Hi,

 

We have numerous pages under the directory "/content/products/zh-cn/construction-equipment/service".

Some examples include:

/content/products/zh-cn/construction-equipment/service/parts-and-service-for-cobra-petrol-breakers/rodsandbits
/content/products/zh-cn/construction-equipment/service/parts-and-service-for-pneumatic-rock-drills/rodsandbits
It has come to our attention that some of these pages contain English content, despite having their language setting configured as Chinese.

 

 

Can anyone help to get list of pages with english content.

 

 

 

 

1 Accepted Solution

Avatar

Correct answer by
Community Advisor

@Vani1012 

 

There is nothing available OOTB to validate this. But, if image paths also get localized, but are EN on zh-cn content, then maybe you can check for non-en references of EN images.

 

Sample code is available here: https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/how-to-find-the-list-of-al...

 

 

Or if there is any metadata that can help (like tags/language), it might be good to check


Aanchal Sikka

View solution in original post

4 Replies

Avatar

Correct answer by
Community Advisor

@Vani1012 

 

There is nothing available OOTB to validate this. But, if image paths also get localized, but are EN on zh-cn content, then maybe you can check for non-en references of EN images.

 

Sample code is available here: https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/how-to-find-the-list-of-al...

 

 

Or if there is any metadata that can help (like tags/language), it might be good to check


Aanchal Sikka

Avatar

Level 7

Hi @aanchal-sikka 

If she wants to get the list of English pages available under a specific path, can't she use Query Builder with fulltext search? I understand there are limitations for using fulltext. However, if there's a specific property that is set for Chinese only or English, maybe that could be helpful.

I may be wrong, but I'm open to suggestions.

Avatar

Community Advisor

Hi @Vani1012 ,

Sure, I can help you with a Java program to identify pages with English content. Here’s how you can do it:

  1. Crawl the Directory: Fetch all pages under the specified directory.
  2. Fetch Page Content: For each URL, download the HTML content.
  3. Language Detection: Analyze the text content of each page to determine its language.
  4. List English Pages: Compile a list of pages where English content is detected.

Below is a Java program that demonstrates this approach:

Java Program

Step 1: Add Dependencies

You'll need the following libraries:

  • jsoup for HTML parsing
  • apache-lucene for language detection

Add these dependencies to your pom.xml if you're using Maven:

 

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.14.3</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-langdetect</artifactId>
        <version>2.1.0</version>
    </dependency>
</dependencies>

 

Step 2: Java Code

Here's the Java code to fetch pages, detect the language, and list pages with English content:

 

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.apache.tika.language.detect.LanguageDetector;
import org.apache.tika.language.detect.LanguageResult;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class EnglishContentDetector {

    public static void main(String[] args) throws IOException {
        String baseUrl = "https://www.example.com";
        List<String> urls = List.of(
            "/content/products/zh-cn/construction-equipment/service/parts-and-service-for-cobra-petrol-breakers/rodsandbits",
            "/content/products/zh-cn/construction-equipment/service/parts-and-service-for-pneumatic-rock-drills/rodsandbits"
            // Add more URLs as needed
        );

        List<String> englishPages = new ArrayList<>();
        LanguageDetector detector = LanguageDetector.getDefaultLanguageDetector().loadModels();

        for (String url : urls) {
            String fullUrl = baseUrl + url;
            Document doc = Jsoup.connect(fullUrl).get();
            String text = doc.body().text();
            
            LanguageResult result = detector.detect(text);
            if ("en".equals(result.getLanguage())) {
                englishPages.add(fullUrl);
            }
        }

        System.out.println("Pages with English content:");
        for (String page : englishPages) {
            System.out.println(page);
        }
    }
}

 

Explanation

  1. Dependencies: We use jsoup for fetching and parsing HTML content and apache-tika for language detection.
  2. Fetching URLs: The list urls should contain all URLs under the specified directory. This list can be populated using a web crawler or manually.
  3. Fetching Content: We send an HTTP GET request to each URL to fetch its content using Jsoup.connect(fullUrl).get().
  4. Language Detection: We use Apache Tika’s LanguageDetector to detect the language of the text content.
  5. Listing English Pages: If a page is detected to have English content, its URL is added to the englishPages list.

notes:

  • Replace https://www.example.com with the actual base URL of your website.
  • Ensure you have permission to crawl the website.
  • Adjust the list of urls to include all pages under the specified directory.

This approach should help you identify pages with English content in the specified directory.




Avatar

Level 7

Thanks ChatGPT, I've been following you, and this is considered spamming, I am reporting you. 

You should have a word with the administrators, this is getting out of hand. It's just very disrespectful to the community... You should really stop.. Spamming is not good, and also shows us that you are not a professional because anyone can use chat GPT to post questions on the query.