Delete Unreferenced Assets AEM

Problem Statement:

Delete all the assets which don’t have references to improve AEM performance in turn Indexes and search/query performance.

Introduction:

How do assets get published?

  1. The author uploads the images and publishes the assets
  2. Create a launcher and workflow which processes assets metadata and publishes the pages
  3. Whenever we publish any pages and if the page has references to assets, then during publishing, it asks to replicate the references as well.

What happens when the page is unpublished?

  1. When the page is deactivated, assets referenced to the page will not be deactivated because this asset might have reference to the other pages hence out of the box assets won’t be deactivated.
  2. If we perform cleanup, deactivate and delete old pages, we might not be cleaning up assets related to this page.

What advantages of cleaning up of old assets?

  1. Drastically reduces repository size
  2. Improves DAM Asset search
  3. Improves indexing

Generate Published Asset Report by visiting:

Go to Tools -> Assets -> Reports as shown below:

Asset Report Tool Section

Click on create and click on Publish report

Asset Publish Report

Provide folder path and start date and end date

Asset Report Configure Page

Select the columns as per the requirement

Asset Report Custom Column Page

Finally, the report will be ready with all the assets lists as shown below

Asset Report

Download the report to see the final list of images

DAM Report result

To generate the report let’s create a new AEM tool:

AEM OOTB comes with multiple tools in AEM and to access all the tools you need to navigate to the tool section and select the appropriate sections to perform all the operations

For example:
  1. AEM operations
  2. Managing templates
  3. Cloud configurations
  4. ACS Commons tools etc..

Tools are an essential part of AEM and avoid any dependency on Groovy scripts or any add scripts and can be managed or extended at any given point in time.

Broken Asset Report generates a report of all the unreferenced assets by running a Reference Search query across the repository every 30s (by default), you can update the scheduler expression based on your repository size.

The scheduler also checks for the CPU/HEAP size before triggering the reference search process and for more details on Throttled scheduler please refer to the link.

In order to create a perfect content backup package please use the Tool generator to give your tool name and descriptions: https://kiransg.com/2022/11/24/aem-tool-create-generate-tool-from-scratch

Please refer to my GitHub repository for working code on Broken Asset Reference:

Once the repository is built and deployed, you will be able to access the Broken Asset Reference report as shown below:

Broken Asset Reference Tool Section

You can select the report just now created by you in the drop down as shown below:

Select the Asset Report from dropdown

Provide scheduler expression as per your needs and select the result refresh interval for every 10s or as per your needs

You can see the results as shown below once the process is kicked off and it will also show the current row its processing and also CPU/Heap usage.

Reference Search Running status

For some reason, if your system CPU/Heap is throttling then from the backend it takes care of not running your scheduler or you can also manually unschedule the scheduler.

Once your system’s CPU comes back to normal you can go back and select the report and schedule again and report generation picks from the current row where it was left off.

Once the processing is complete click on the report name to download the generated report.

Reference Search Completed Status
Report Excel with Has reference Column

Cross verification:

  1. You can rerun the generated report on the MCP broken asset reference.
  2. Generate the Splunk (logs) results by running a query to get all the assets to call (/content/dam) requests on dispatcher/publisher from the past 1year or so.
  3. You can also reach out to the Analytics team, requesting image impressions (data on image usage) from the past 1 year or so.

Please provide your valuable feedback in the comments.

AEM Performance Optimization CPU/HEAP Status

Problem Statement:

As a developer or user, I would like to make an informed decision by fetching the AEM system CPU/Heap status before or while running a process.

Introduction:

Use cases for developers:

  1. Infinite loops – a coding error
  2. Garbage collection is not handled – unclosed streams
  3. An exception like out of bound issue
  4. Heap size issue – saving loads of data or declaring/manipulating too many strings

Java MX Bean is an API that provides detailed information on JVM CPU/MEM (Heap) status.

A platform MXBean is a managed bean that conforms to the JMX Instrumentation Specification and only uses a set of basic data types. A JMX management application and the platform MBeanServer can interoperate without requiring classes for MXBean specific data types. The data types being transmitted between the JMX connector server and the connector client are open types and this allows interoperation across versions. See the specification of MXBeans for details.

Each platform MXBean is a PlatformManagedObject and it has a unique ObjectName for registration in the platform MBeanServer as returned by the getObjectName method.

As developer before running or while running any bulk process or schedulers, it’s always better to get system information.

ThrottledTaskRunnerStats Service:

Create ThrottledTaskRunnerStats service as shown below:

package com.aem.operations.core.services;

import javax.management.InstanceNotFoundException;
import javax.management.ReflectionException;

/**
 * Private interface for exposing ThrottledTaskRunner stats
 * **/
public interface ThrottledTaskRunnerStats {
    /**
     * @return the % of CPU being utilized.
     * @throws InstanceNotFoundException
     * @throws ReflectionException
     */
    double getCpuLevel() throws InstanceNotFoundException, ReflectionException;

    /**
     * The % of memory being utilized.
     * @return
     */
    double getMemoryUsage();

    /***
     * @return the OSGi configured max allowed CPU utilization.
     */
    double getMaxCpu();

    /***
     * @return the OSGi configured max allowed Memory (heap) utilization.
     */
    double getMaxHeap();

    /**
     * @return the max number of threads ThrottledTaskRunner will use to execute the work.
     */
    int getMaxThreads();
}

ThrottledTaskRunnerImpl Service Implementation:

Create ThrottledTaskRunnerImpl service implementation as shown below:

package com.aem.operations.core.services.impl;

import java.lang.management.ManagementFactory;
import javax.management.Attribute;
import javax.management.AttributeList;
import javax.management.AttributeNotFoundException;
import javax.management.InstanceNotFoundException;
import javax.management.MBeanException;
import javax.management.MBeanServer;
import javax.management.MalformedObjectNameException;
import javax.management.ObjectName;
import javax.management.ReflectionException;
import javax.management.openmbean.CompositeData;
import com.aem.operations.core.services.ThrottledTaskRunnerStats;
import org.osgi.service.component.annotations.Activate;
import org.osgi.service.component.annotations.Component;
import org.osgi.service.component.annotations.Modified;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

@Component(service = ThrottledTaskRunnerStats.class, immediate = true, name = "Throttled Task Runner Service Stats")
public class ThrottledTaskRunnerImpl implements ThrottledTaskRunnerStats {

    private static final Logger LOGGER = LoggerFactory.getLogger(ThrottledTaskRunnerImpl.class);
    private final MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();
    private ObjectName osBeanName;
    private ObjectName memBeanName;

    @Activate
    @Modified
    protected void activate() {
        try {
            memBeanName = ObjectName.getInstance("java.lang:type=Memory");
            osBeanName = ObjectName.getInstance("java.lang:type=OperatingSystem");
        } catch (MalformedObjectNameException | NullPointerException ex) {
            LOGGER.error("Error getting OS MBean (shouldn't ever happen) {}", ex.getMessage());
        }
    }

    @Override
    public double getCpuLevel() throws InstanceNotFoundException, ReflectionException {
        // This method will block until CPU usage is low enough
        AttributeList list = mbs.getAttributes(osBeanName, new String[]{"ProcessCpuLoad"});

        if (list.isEmpty()) {
            LOGGER.error("No CPU stats found for ProcessCpuLoad");
            return -1;
        }

        Attribute att = (Attribute) list.get(0);
        return (Double) att.getValue();
    }

    @Override
    public double getMemoryUsage() {
        try {
            Object memoryusage = mbs.getAttribute(memBeanName, "HeapMemoryUsage");
            CompositeData cd = (CompositeData) memoryusage;
            long max = (Long) cd.get("max");
            long used = (Long) cd.get("used");
            return (double) used / (double) max;
        } catch (AttributeNotFoundException | InstanceNotFoundException | MBeanException | ReflectionException e) {
            LOGGER.error("No Memory stats found for HeapMemoryUsage", e);
            return -1;
        }
    }

    @Override
    public double getMaxCpu() {
        return 0.75;
    }

    @Override
    public double getMaxHeap() {
        return 0.85;
    }

    @Override
    public int getMaxThreads() {
        return Math.max(1, Runtime.getRuntime().availableProcessors()/2);
    }
}

CPU Status Servlet:

Create a CpuStatusServlet based on the path as shown below: 

package com.aem.operations.core.servlets;

import java.io.IOException;
import java.io.PrintWriter;
import java.text.MessageFormat;
import javax.management.InstanceNotFoundException;
import javax.management.ReflectionException;
import javax.servlet.Servlet;
import javax.servlet.ServletException;
import com.aem.operations.core.services.ThrottledTaskRunnerStats;
import org.apache.sling.api.SlingHttpServletRequest;
import org.apache.sling.api.SlingHttpServletResponse;
import org.apache.sling.api.servlets.SlingAllMethodsServlet;
import org.osgi.service.component.annotations.Component;
import org.osgi.service.component.annotations.Reference;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.gson.Gson;
import com.google.gson.JsonObject;

@Component(service = { Servlet.class }, property = { "sling.servlet.paths=" + CpuStatusServlet.RESOURCE_PATH,
        "sling.servlet.methods=POST" })
public class CpuStatusServlet extends SlingAllMethodsServlet{

    private static final Logger LOGGER = LoggerFactory.getLogger(CpuStatusServlet.class);

    private static final long serialVersionUID = 1L;
    public static final String RESOURCE_PATH = "/bin/cpustatust";
    private static final String MESSAGE_FORMAT = "{0,number,#%}";

    @Reference
    private transient ThrottledTaskRunnerStats ttrs;

    @Override
    protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response)
            throws ServletException, IOException {
        JsonObject jsonResponse = getSystemStats();
        try (PrintWriter out = response.getWriter()) {
            out.print(new Gson().toJson(jsonResponse));
        }
    }

    private JsonObject getSystemStats() {
        JsonObject json = new JsonObject();
        try {
            json.addProperty("cpu", MessageFormat.format(MESSAGE_FORMAT, ttrs.getCpuLevel()));
        } catch (InstanceNotFoundException | ReflectionException e) {
            LOGGER.error("Could not collect CPU stats {}", e.getMessage());
            json.addProperty("cpu", -1);
        }
        json.addProperty("mem", MessageFormat.format(MESSAGE_FORMAT, ttrs.getMemoryUsage()));
        json.addProperty("maxCpu", MessageFormat.format(MESSAGE_FORMAT, ttrs.getMaxCpu()));
        json.addProperty("maxMem", MessageFormat.format(MESSAGE_FORMAT, ttrs.getMaxHeap()));
        return json;
    }
}

Frontend:

Make a request to a servlet from frontend using AJAX call as shown below:

function getStatus(showStatus) {
   $.ajax({
      // add the servlet path
      url: "/bin/cpustatust",
      method: "GET",
      async: true,
      cache: false,
      contentType: false,
      processData: false
   }).done(function (data) {
      if (data) {
         data = JSON.parse(data);
         $("#table-body").append(`<tr>
						<td>${data.cpu}/${data.maxCpu}</td>
			            <td>${data.mem}/${data.maxMem}</td>
		           		 </tr>`);
      } else {
         $(".modal").hide();
         showStatus = false;
         ui.notify("Error", "Unable to get the status", "error");
      }
   }).fail(function (data) {
      $(".modal").hide();
      showStatus = false;
      if (data && data.responseJSON && data.responseJSON.message) {
         ui.notify("Error", data.responseJSON.message, "error");
      } else {
         //add error message
         ui.notify("Error", "Unable to get the status", "error");
      }
   });
   if (showStatus) {
      setTimeout(() => {
         emptyResults();
         getStatus(true);
      }, 2000);
   }
}
function emptyResults() {
   $("#table-body").empty();
}
<table id="table-data">
   <thead id="table-head">
   <tr>
      <th>CPU Usage</th>
      <th>Memory (Heap) Usage</th>
   </tr>
   </thead>
   <tbody id="table-body">
   </tbody>
</table>

The output of the status:

CPU Memory status result

For more information on Throttled Task runner-based performance optimization please visit:

  1. AEM Performance Optimization Scheduler
  2. AEM Performance Optimization Workflow
  3. AEM Performance Optimization Replication

AEM Performance Optimization Workflow

Problem Statement:

AEM Workflows allow you to automate a series of steps that are performed on (one or more) pages and/or assets.

How can we make sure workflow won’t impact AEM performance (CPU or Heap memory) / throttle the system?

Introduction:

AEM Workflows allow you to automate a series of steps that are performed on (one or more) pages and/or assets.

For example, when publishing, an editor has to review the content – before a site administrator activates the page. A workflow that automates this example notifies each participant when it is time to perform their required work:

  1. The author applies the workflow to the page.
  2. The editor receives a work item that indicates that they are required to review the page content. When finished, they indicate that their work item is complete.
  3. The site administrator then receives a work item that requests the activation of the page. When finished, they indicate that their work item is complete.

Typically:

  • Content authors apply workflows to pages as well as participate in workflows.
  • The workflows that you use are specific to the business processes of your organization.

ACS Commons Throttled Task Runner is built on Java management API for managing and monitoring the Java VM and can be used to pause tasks and terminate the tasks based on stats.

Throttled Task Runner (a managed thread pool) provides a convenient way to run many AEM-specific activities in bulk it basically checks for the Throttled Task Runner bean and gets current running stats of the actual work being done.

OSGi Configuration:

The Throttled Task Runner is OSGi configurable, but please note that changing configuration while work is being processed results in resetting the worker pool and can lose active work.

Throttled task runner OSGi

Max threads: Recommended not to exceed the number of CPU cores. Default 4.

Max CPU %: Used to throttle activity when CPU exceeds this amount. Range is 0..1; -1 means disable this check.

Max Heap %: Used to throttle activity when heap usage exceeds this amount. Range is 0..1; -1 means disable this check.

Cooldown time: Time to wait for CPU/MEM cooldown between throttle checks (in milliseconds)

Watchdog time: Maximum time allowed (in ms) per action before it is interrupted forcefully.

JMX MBeans

Throttled Task Runner MBean

This is the core worker pool. All action managers share the same task runner pool, at least in the current implementation. The task runner can be paused or halted entirely, throwing out any unfinished work.

Throttled task runner JMX
Throttled task runner JMX

How to use ACS Commons throttled task runner

Add the following dependency to your pom

<dependency>
    <groupId>com.adobe.acs</groupId>
    <artifactId>acs-aem-commons-bundle</artifactId>
    <version>5.0.4</version>
    <scope>provided</scope>
</dependency>

Create a custom workflow and call the service as shown below:

Throttled Workflow

inside the custom workflow process method check for the Low CPU and Low memory before starting your task to avoid performance impact on the system.

For all the bulk workflow executions on pages/assets, it’s recommended to use ACS commons bulk workflow.

For best practices on workflow please refer to the link

Broken Page References AEM


Problem Statement:

How to get the list of all the broken references in AEM?


Requirement:

Get a List of all the broken references using MCP and provide the report


Introduction:

OOTB we get a Broken reference report provided by MCP, which can be used to get all the broken references in the content repo.

Broken Refernce Report

It’s highly recommended to run this process during

  1. off hours
  2. Don’t run on the root level
  3. Run it on 2nd level or 3rd level pages

How to run this process?

Provide Source path

Provide the regex so that it will consider only the references which point to /content or /etc (points to AEM)

You can also provide exclude properties to improve the traversal of nodes.

If you want to verify any broken links in the RTE fields or properties, then check the deep check checkbox and provide the properties list.

But the above process has a few issues.

  1. Html properties are not working as expected

We need a few customizations to this process by making a few changes to check HTML level references by adding JSOUP API

Add the following dependencies to your POM.xml

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.adobe.acs</groupId>
    <artifactId>acs-aem-commons-bundle</artifactId>
    <scope>provided</scope>
</dependency>

Get the following Broken reference code into your local as shown below:

Add the following code as shown below:

if (htmlFields.contains(property.getKey())) {
            stream = stream.flatMap(val -> {
                try {
                    Document doc = Jsoup.parse(val);
                    Elements anchors = doc.select("a");
                    return anchors.stream().map(link -> link.attr("href"));
                } catch (Exception e) {
                    log.warn("Could not parse links from property value of {}", property.getKey(), e);
                    return Stream.empty();
                }
            });
        }
At Line number 207

When we run it on wknd site it would look something like this:

Broken Reference Report