Broken Page References AEM


Problem Statement:

How to get the list of all the broken references in AEM?


Requirement:

Get a List of all the broken references using MCP and provide the report


Introduction:

OOTB we get a Broken reference report provided by MCP, which can be used to get all the broken references in the content repo.

Broken Refernce Report

It’s highly recommended to run this process during

  1. off hours
  2. Don’t run on the root level
  3. Run it on 2nd level or 3rd level pages

How to run this process?

Provide Source path

Provide the regex so that it will consider only the references which point to /content or /etc (points to AEM)

You can also provide exclude properties to improve the traversal of nodes.

If you want to verify any broken links in the RTE fields or properties, then check the deep check checkbox and provide the properties list.

But the above process has a few issues.

  1. Html properties are not working as expected

We need a few customizations to this process by making a few changes to check HTML level references by adding JSOUP API

Add the following dependencies to your POM.xml

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.adobe.acs</groupId>
    <artifactId>acs-aem-commons-bundle</artifactId>
    <scope>provided</scope>
</dependency>

Get the following Broken reference code into your local as shown below:

Add the following code as shown below:

if (htmlFields.contains(property.getKey())) {
            stream = stream.flatMap(val -> {
                try {
                    Document doc = Jsoup.parse(val);
                    Elements anchors = doc.select("a");
                    return anchors.stream().map(link -> link.attr("href"));
                } catch (Exception e) {
                    log.warn("Could not parse links from property value of {}", property.getKey(), e);
                    return Stream.empty();
                }
            });
        }
At Line number 207

When we run it on wknd site it would look something like this:

Broken Reference Report