Broken Page References AEM


Problem Statement:

How to get the list of all the broken references in AEM?


Requirement:

Get a List of all the broken references using MCP and provide the report


Introduction:

OOTB we get a Broken reference report provided by MCP, which can be used to get all the broken references in the content repo.

Broken Refernce Report

It’s highly recommended to run this process during

  1. off hours
  2. Don’t run on the root level
  3. Run it on 2nd level or 3rd level pages

How to run this process?

Provide Source path

Provide the regex so that it will consider only the references which point to /content or /etc (points to AEM)

You can also provide exclude properties to improve the traversal of nodes.

If you want to verify any broken links in the RTE fields or properties, then check the deep check checkbox and provide the properties list.

But the above process has a few issues.

  1. Html properties are not working as expected

We need a few customizations to this process by making a few changes to check HTML level references by adding JSOUP API

Add the following dependencies to your POM.xml

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.adobe.acs</groupId>
    <artifactId>acs-aem-commons-bundle</artifactId>
    <scope>provided</scope>
</dependency>

Get the following Broken reference code into your local as shown below:

Add the following code as shown below:

if (htmlFields.contains(property.getKey())) {
            stream = stream.flatMap(val -> {
                try {
                    Document doc = Jsoup.parse(val);
                    Elements anchors = doc.select("a");
                    return anchors.stream().map(link -> link.attr("href"));
                } catch (Exception e) {
                    log.warn("Could not parse links from property value of {}", property.getKey(), e);
                    return Stream.empty();
                }
            });
        }
At Line number 207

When we run it on wknd site it would look something like this:

Broken Reference Report

Broken Asset References AEM


Problem Statement:

Get a List of all the Assets which are missing references


Requirement:

Get the list of broken asset references to unpublish and remove them repo to improve the system stability and performance.


Introduction:

How do assets get published?

  1.  The author uploads the images and publishes the assets
  2. Create a launcher and workflow which process assets metadata and publish the pages
  3. Whenever we publish any pages and if the page has references to assets, then during publishing, it asks to replicate the references as well.

What happens when the page is unpublished?

  1.  When the page is deactivated, assets referenced to the page will not be deactivated because this asset might have reference to the other pages hence out of the box assets won’t be deactivated.
  2. If we perform cleanup, deactivate and delete old pages, we might not be cleaning up assets related to this page.

Advantages of cleaning up old assets?

  1. Drastically reduces repository size
  2. Improves DAM Asset search
  3. Improves indexing

Get Publish Report using Assets Report:

  • Go to Tools -> Assets -> Reports as shown below:
Asset Reports
  • Click on create and click on Publish report
Select Publish report
  • Provide folder path and start date and end date
Add Report details
  • Select the columns as per requirement
Configure columns for the report
  • Finally, report will be ready with all the assets lists as shown below
Completed Reports
  • Download the report to see the final list of images
Example Report CSV file

If Images are unpublished then we can ask authors to review and delete them

If images are published but has no references to figure this out, we need a new process.

Broken Asset reference:

Add the following dependencies to your pom.xml

        <dependency>
            <groupId>org.jetbrains</groupId>
            <artifactId>annotations</artifactId>
            <version>17.0.0</version>
        </dependency>
        <dependency>
            <groupId>com.adobe.acs</groupId>
            <artifactId>acs-aem-commons-bundle</artifactId>
            <version>5.2.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.13.1</version>
            <scope>provided</scope>
        </dependency>

Create a new MCP process by calling the MCP service and providing the implementation class:

package com.mysite.core.mcp;

import org.osgi.service.component.annotations.Component;
import org.osgi.service.component.annotations.Reference;
import com.adobe.acs.commons.mcp.ProcessDefinitionFactory;
import com.day.cq.replication.Replicator;

@Component(service = ProcessDefinitionFactory.class)
public class BrokenAssetsFactory extends ProcessDefinitionFactory<BrokenAssets> {

    @Reference
    Replicator replicator;

    @Override
    public String getName() {
        return "Broken Asset References";
    }

    @Override
    public BrokenAssets createProcessDefinitionInstance() {
        return new BrokenAssets(replicator);
    }
}

Create an implementation class as shown below:

package com.mysite.core.mcp;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.EnumMap;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.Optional;
import java.util.stream.Collectors;
import javax.jcr.RepositoryException;
import org.apache.commons.collections4.ListUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.jackrabbit.util.Text;
import org.apache.sling.api.request.RequestParameter;
import org.apache.sling.api.resource.LoginException;
import org.apache.sling.api.resource.PersistenceException;
import org.apache.sling.api.resource.Resource;
import org.apache.sling.api.resource.ResourceResolver;
import org.jetbrains.annotations.NotNull;
import com.adobe.acs.commons.data.CompositeVariant;
import com.adobe.acs.commons.data.Spreadsheet;
import com.adobe.acs.commons.fam.ActionManager;
import com.adobe.acs.commons.mcp.ProcessDefinition;
import com.adobe.acs.commons.mcp.ProcessInstance;
import com.adobe.acs.commons.mcp.form.Description;
import com.adobe.acs.commons.mcp.form.FileUploadComponent;
import com.adobe.acs.commons.mcp.form.FormField;
import com.adobe.acs.commons.mcp.form.RadioComponent;
import com.adobe.acs.commons.mcp.model.GenericReport;
import com.adobe.acs.commons.mcp.model.ManagedProcess;
import com.day.cq.replication.ReplicationStatus;
import com.day.cq.replication.Replicator;

/**
 * Relocate Pages and/or Sites using a parallelized move process
 */
public class BrokenAssets extends ProcessDefinition {
	
	private static final String SOURCE_PATH = "Path";
	private static final String CONTAINS_QUERY = "CONTAINS(s.*, '%s') or ";
	private final GenericReport report = new GenericReport();
	List<EnumMap<ReportColumns, String>> reportData = Collections.synchronizedList(new ArrayList<>());
	
	List<String> assetsList = new LinkedList<>();
	
    public enum PublishMethod {
        @Description("Select this option to generate Broken References Report")
        BROKEN_REFERENCE_REPORT
    }
    
    Replicator replicatorService;
    
    @FormField(name = "Mapping Excel", component = FileUploadComponent.class)
    private RequestParameter mappingExcel;
    private Spreadsheet spreadsheet;
    
    @FormField(name = "Type",
            description = "Please select the oprporiate option to execute",
            component = RadioComponent.EnumerationSelector.class,
            required = true,
            options = {"vertical", "default=BROKEN_REFERENCE_REPORT"})
    public PublishMethod publishMethod = PublishMethod.BROKEN_REFERENCE_REPORT;

	@FormField(name="Content Path",
			description="Content Path for search results",
			hint="/content",
			options={"default=/content"})
	public String contentPath = "/content";
    
    @FormField(name="Chunk count",
            description="Max number of chunk the search results",
            hint="4500",
            options={"default=3000"})
    public int chunkCount = 3000;
    
    @FormField(name="Retries",
            description="Max number of retries per commit",
            hint="2",
            options={"default=2"})
    public int retryCount = 2;
    
    @FormField(name="Retry delay",
            description="Delay between retries (in milliseconds)",
            hint="500,1000,...",
            options={"default=500"})
    public int retryWait = 500;
    
    public BrokenAssets(Replicator replicator) {
    	replicatorService = replicator;
	}

    @Override
    public void init() throws RepositoryException {    	
    	if(null != mappingExcel && mappingExcel.getSize() > 0) {
    		try {
                // Read spreadsheet
        		spreadsheet = new Spreadsheet(mappingExcel, SOURCE_PATH).buildSpreadsheet();
        		spreadsheet.getDataRowsAsCompositeVariants().forEach(row -> assetsList.add(getString(row, SOURCE_PATH)) );
            } catch (IOException e) {
                throw new RepositoryException("Unable to process spreadsheet", e);
            }
    	}    	    	    	
    }

    ManagedProcess instanceInfo;

    @Override
    public void buildProcess(ProcessInstance instance, ResourceResolver rr) throws LoginException, RepositoryException {
        instanceInfo = instance.getInfo();
        switch (publishMethod.name().toLowerCase()) {
	        case "broken_reference_report":
	        	instance.getInfo().setDescription("Collecting references");
	        	instance.defineAction("Searching Refs", rr, this::collectRefernce);
	            break;
        }        
    }

    @Override
    public void storeReport(ProcessInstance instance, ResourceResolver rr) throws RepositoryException, PersistenceException {    	
        report.setRows(reportData, ReportColumns.class);
        report.persist(rr, instance.getPath() + "/jcr:content/report");
    }

    public enum ReportColumns {
    	ASSET_PATH
    }
            
    protected void collectRefernce(ActionManager manager) {
    	manager.deferredWithResolver(this::collectReferences); 
    }
    
    private void collectReferences(ResourceResolver resourceResolver) throws Exception {    	
		if(!assetsList.isEmpty()) {						 
			RetryUtils.withRetry(retryCount, retryWait, () -> {
			    List<List<String>> chunkedAssetList = ListUtils.partition(assetsList, chunkCount);					
			    int counter = 0;
			    while(!chunkedAssetList.isEmpty() && counter < chunkedAssetList.size()) {
			        List<String> chunk = chunkedAssetList.get(counter);
			        counter++;
			        @NotNull Iterator<Resource> resourceResults = buildSQLQueryAndFetchResults(resourceResolver, contentPath, chunk);
			        collectReferences(resourceResults, chunk);
			    }
			});	
			if(!assetsList.isEmpty()) {
				assetsList.stream().forEach(this::reportResult);
			}
		}	        			
    }

	private void collectReferences(@NotNull Iterator<Resource> resourceResults, List<String> chunk) {
		while(resourceResults.hasNext()) {
			Resource resultRes = resourceResults.next();
			if (null != resultRes) {
				resultRes.getValueMap().entrySet().forEach(property -> {
					if(!property.getKey().equalsIgnoreCase("dam:folderThumbnailPaths")) {
						Object prop = property.getValue();
						if (prop.getClass() == String[].class) {
							List<String> propertyValue = Arrays.asList((String[]) prop);
							if (!propertyValue.isEmpty()) {
								List<String> matchingAsset = chunk.stream().filter(
										assetPat -> propertyValue.stream().anyMatch(sam -> sam.contains(assetPat)))
										.collect(Collectors.toList());
								if(!matchingAsset.isEmpty()) {
									chunk.removeIf(matchingAsset::contains);
									assetsList.removeIf(matchingAsset::contains);
								}
							}
						} else if (prop.getClass() == String.class) {
							String propertyValue = (String) prop;
							if (StringUtils.isNotEmpty(propertyValue) ) {
								Optional<String> matchingAsset = chunk.stream().filter(propertyValue::contains).findAny();
								if(matchingAsset.isPresent()) {
									chunk.remove(matchingAsset.get());
									assetsList.remove(matchingAsset.get());
								}
							}
						}
					}
				});
			}
		}
	}

	private void reportResult(String path) {
		EnumMap<ReportColumns, String> row = new EnumMap<>(ReportColumns.class);
		row.put(ReportColumns.ASSET_PATH, path);		
		reportData.add(row);
	}	
	
	private @NotNull Iterator<Resource> buildSQLQueryAndFetchResults(ResourceResolver resolver, String contentPath, List<String> chunk) {
        String groupStr = chunk.stream().map(r -> String.format(CONTAINS_QUERY, Text.escapeIllegalXpathSearchChars(r).replaceAll("'", "''"))).collect(Collectors.joining());        
		String querySt = "SELECT * FROM [nt:base] AS s WHERE ISDESCENDANTNODE(["+contentPath+"]) and "+ StringUtils.substringBeforeLast(groupStr, "or");
		return resolver.findResources(querySt, "JCR-SQL2");
    }
	
	@SuppressWarnings({"rawtypes", "unchecked"})
    private String getString(Map<String, CompositeVariant> row, String path) {
        CompositeVariant v = row.get(path.toLowerCase(Locale.ENGLISH));
        if (v != null) {
            return (String) v.getValueAs(String.class);
        } else {
            return null;
        }
    }
}
package com.mysite.core.mcp;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class RetryUtils {
    private static final Logger log = LoggerFactory.getLogger(RetryUtils.class);

    public static interface CallToRetry {
        void process() throws Exception;
    }

    public static boolean withRetry(int maxTimes, long intervalWait, CallToRetry call) throws Exception {
        if (maxTimes <= 0) {
            throw new IllegalArgumentException("Must run at least one time");
        }
        if (intervalWait <= 0) {
            throw new IllegalArgumentException("Initial wait must be at least 1");
        }
        Exception thrown = null;
        for (int i = 0; i < maxTimes; i++) {
            try {
                call.process();
                return true;
            } catch (Exception e) {
                thrown = e;
                log.info("Encountered failure on {} due to {}, attempt retry {} of {}", call.getClass().getName() , e.getMessage(), (i + 1), maxTimes, e);
            }
            try {
                Thread.sleep(intervalWait);
            } catch (InterruptedException wakeAndAbort) {
                break;
            }
        }
        throw thrown;
    }
}

Running Asset Reference Process:

After building the code you can see the new Process showing up in MCP

Borken Asset Refernce Process

Copy the Path column into the new Excel sheet as shown below

Path column into new excel file

Upload into the process and start to see all the images which are published yet unreferenced as shown below

Why does Chunk count?

Chunk count helps the SQL 2 query to group by the paths, which will be maxing 4500 and it won’t take more than that (configurable based on the environment). However basically, if we have 20000 / 4500 = 4.44 ~ 5 we will be running the query max five times to generate the below report

Share the report with the content authors team to validate if images are required if not plan to clean up using

After completing the process
Example report and we can download and share with Authors

Clean up Process:

Authors don’t have to unpublish and delete individual images, then can you use the below process to upload the excel sheet with all the approved image paths and upload it to the process to deactivate and delete

AEM Publish / UnPublish / Delete List of pages – MCP Process

AEM Asset (Repo) Cleanup – ACS Commons Renovator

Problem Statement:

How to clean up the growing repo? How to safely delete all the unwanted assets and pages.

Requirement:

Find the references of all the assets and pages and clean up unreferenced assets and unwanted pages.

Introduction:

You can call the below process as asset, pages references report.

Usually, with growing repo size, we usually do logs rotation and archiving, we also do some compactions (Revision cleanup).

What if we could remove some of the deactivated and unreferenced assets or pages?

How to find references of assets or pages?

Go to the following url https://{domain}/apps/acs-commons/content/manage-controlled-processes.html and click on Start Process and select Renovator process as shown below:

Start Process
Renovator Process

I am trying to check the references of all the assets under the following path:

Source path: /content/dam/wknd/en/activities

And select some random path into the Destination path: /content/dam/wknd/en/magazine

And please do make sure to check the Dry run and Detailed Report checkboxes, if not checked all the assets will be moved to the new folder i.e, /content/dam/wknd/en/magazine

Process fields selections

Once you start the process you would see the process take some time to run and you can click on the process and open the view or download the excel report as shown:

Process result page
View the results popup

Once downloaded delete the following columns:

Remove unwanted columns

You can see some of the rows have empty references and if you think these assets are no more required then you can remove them

Unreferenced rows

How to remove the unreferenced assets or pages safely?

Use the following resource to learn more about:

AEM Publish / UnPublish / Delete List of pages – MCP Process

You can run through the above steps on any of the folders and please make sure to avoid running on root folders or pages like: /content/dam or /content or home pages because would slow down the servers

For more information on how to use the renovator process for

AEM Bulk move pages – MCP