Enterprise Java

Sanitizing HTML to Prevent XSS Attacks Using OWASP

Sanitizing user-generated HTML is essential for preventing XSS attacks in Java applications. Two widely used libraries for this are OWASP Java HTML Sanitizer and JSoup. OWASP provides strict, policy-based control ideal for high-security needs, while JSoup offers a simple and flexible approach for general HTML cleanup using Safelist. In this article, we demonstrate how to use both libraries to safely process and render user input.

1. HTML Sanitization

Sanitization ensures that only safe and allowed HTML elements and attributes are retained from the input. This approach is different from escaping, which neutralizes all HTML but preserves the raw content. Sanitization allows safe HTML formatting (like <b>, <p>, etc.) while stripping dangerous tags such as <script>, <iframe>, and event attributes like onload or onclick.

The OWASP Java HTML Sanitizer library provides a powerful and customizable way to sanitize HTML input in Java applications. It is fast, secure, and highly configurable.

Maven Setup

Add the following to pom.xml to include the OWASP Sanitizer dependency.

        <dependency>
            <groupId>com.googlecode.owasp-java-html-sanitizer</groupId>
            <artifactId>owasp-java-html-sanitizer</artifactId>
            <version>20240325.1</version>
        </dependency>

This configuration includes the OWASP HTML sanitizer, which provides utilities to define and apply sanitization policies.

2. Using Predefined (Basic) OWASP Sanitizers

This class uses OWASP’s predefined sanitizers (FORMATTING, BLOCKS, LINKS) for general-purpose use.

public class XssSanitizer {

    public static String applyBasicSanitizers(String html) {
        PolicyFactory policy = Sanitizers.FORMATTING.and(Sanitizers.BLOCKS).and(Sanitizers.LINKS);
        return policy.sanitize(html);
    }

    public static void main(String[] args) {
        String inputHtml = "<p>Hello</p> <script>alert('XSS')</script> <a href='http://example.com'>Link</a> <img src='x' onerror='stealCookies()'>";

        System.out.println(inputHtml);

        System.out.println("\n");
        System.out.println(applyBasicSanitizers(inputHtml));
    }
}

This class defines a method that sanitizes HTML input using a combination of pre-defined OWASP policies. The applyBasicSanitizers method constructs a PolicyFactory by chaining together Sanitizers.FORMATTING, Sanitizers.BLOCKS, and Sanitizers.LINKS. The FORMATTING policy allows safe inline tags such as <b>, <i>, and <p>, while the BLOCKS policy supports structural HTML elements like <div> and <section>.

The LINKS policy enables the use of <a> tags with validated href attributes, preventing potentially malicious URLs. By combining these policies, the class ensures that safe and useful HTML elements are retained, while dangerous content like <script> tags and event handler attributes (e.g., onerror) are stripped out, effectively mitigating the risk of Cross-Site Scripting (XSS) attacks.

Example Response

<p>Hello</p>  <a href="http://example.com" rel="nofollow">Link</a>

As we can see, the <script> tag and unsafe attributes like onerror are removed, while safe formatting elements such as <b> are preserved, ensuring that only allowed tags and attributes remain after sanitization.

3. Advanced: Custom Policy with HtmlPolicyBuilder

Sometimes we need more control than the default Sanitizers provide. For example, we might only want to allow <p>, <a>, and <img> tags with specific attributes.

public class CustomPolicyFactory {

    public static String sanitize(String html) {
        PolicyFactory policy = new HtmlPolicyBuilder()
                .allowElements("p", "strong", "em", "a", "img")
                .allowAttributes("href").onElements("a")
                .allowAttributes("src", "alt").onElements("img")
                .allowUrlProtocols("http", "https")
                .requireRelNofollowOnLinks()
                .toFactory();

        return policy.sanitize(html);
    }
    
    public static void main(String[] args) {
        String inputHtml = "<p>Hello</p> <script>alert('XSS')</script> <a href='http://example.com'>Link</a> <img src='x' onerror='stealCookies()'>";

        System.out.println(inputHtml);

        System.out.println("\n");
        System.out.println(sanitize(inputHtml));
    }
}

This class demonstrates how to define a strict HTML sanitization policy using the HtmlPolicyBuilder API. It allows only a specific subset of safe HTML tags, including <p>, <strong>, <em>, <a>, and <img>, making it suited for scenarios where we want fine-grained control over which elements are permitted. It explicitly allows essential attributes such as href on <a> tags and src and alt on <img> tags, while restricting URL protocols to only http and https to prevent JavaScript injection.

Additionally, it enforces rel="nofollow" on all links to discourage search engines from following potentially user-submitted URLs. This approach ensures that all unsafe tags like <script> and dangerous attributes like onerror are removed, preserving only content that is safe and semantically meaningful.

4. Unit Testing

To verify that the sanitizer functions correctly, create unit tests that cover a range of input scenarios.

public class CustomPolicyFactoryTest {

    @Test
    void testSanitizeScript() {
        String input = "<p>Hello</p> <script>alert('XSS')</script> <a href='http://example.com'>Link</a> <img src='x' onerror='stealCookies()'>";
        String expected = "<p>Hello</p>  <a href=\"http://example.com\" rel=\"nofollow\">Link</a> <img src=\"x\" />";
        String actual = CustomPolicyFactory.sanitize(input);
        assertEquals(expected, actual);
    }
    
}

This test validates that the sanitizer strips script tags while preserving allowed formatting. We can add more test cases to cover other attack vectors and tags.

5. Alternative HTML Sanitizers in Java

While the OWASP Java HTML Sanitizer is the most robust solution for security-focused sanitization, other tools in the Java ecosystem serve useful purposes depending on your specific needs. One notable option is JSoup’s Cleaner.

5.1 Sanitizing HTML with JSoup Cleaner

JSoup is a widely used Java library for parsing, manipulating, and sanitizing HTML. It provides an intuitive API for fetching web content, parsing and extracting data, and modifying HTML using familiar DOM methods, CSS selectors, and XPath-like expressions.

Maven Dependency for JSoup

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.21.1</version>
</dependency>

(Check for the latest version on Maven Central)

Example: JSoup Cleaner

public class JsoupSanitizerExample {

    public static String sanitize(String html) {
        // Use a predefined Safelist
        return Jsoup.clean(html, Safelist.basic());
    }

    public static void main(String[] args) {
        String inputHtml = "<p>Hello <b>world</b><script>alert('xss')</script><img src='http://img.com' onerror='stealCookies()'></p>";
        String cleaned = sanitize(inputHtml);

        System.out.println("Original HTML:\n" + inputHtml);
        System.out.println("\nSanitized Output with JSoup:\n" + cleaned);
    }
}

In this example, we use JSoup’s Safelist.basic() to sanitize an input string containing various HTML tags, including formatting (<p>, <b>), a potentially dangerous <script> tag, a link with an onclick handler, and an <img> tag with an onerror event.

The Safelist.basic() policy is designed to allow common formatting elements such as <b>, <i>, <strong>, <em>, <p>, <ul>, <ol>, <li>, and safe <a> tags with href attributes. It removes any JavaScript event attributes like onclick, strips out <script> elements entirely, and disallows <img> tags unless you are using Safelist.basicWithImages().

As shown in the output (screenshot) below, the <script> and unsafe attributes are removed, and only the safe content remains.

Example Output: Java HTML Sanitization with JSoup Safelist.basic() to Prevent XSS Attacks

JSoup’s Safelist allows customization through the addTags() and addAttributes() methods, which extend the base behavior by allowing additional HTML elements and attributes that are not included in the predefined safelists like basic() or relaxed().

Below is an example demonstrating how to use addTags() to allow new elements and addAttributes() to safely allow custom attributes.

public class JsoupCustomPolicyExample {

    public static void main(String[] args) {
        String inputHtml = "<p>Hello <mark>highlighted</mark> text</p>"
                + "<a href='http://example.com' target='_blank'>Visit</a>"
                + "<span style='color:red;'>Styled</span>";

        // Start with the basic safelist
        Safelist customSafelist = Safelist.basic()
                // Allow <mark> and <span> tags
                .addTags("mark", "span")
                // Allow 'style' on <span> and 'target' on <a>
                .addAttributes("span", "style")
                .addAttributes("a", "target");

        String sanitizedHtml = Jsoup.clean(inputHtml, customSafelist);

        System.out.println("Original HTML:\n" + inputHtml);
        System.out.println("\nSanitized Output:\n" + sanitizedHtml);
    }
}

In this example, Safelist.basic() serves as the base sanitization policy, and we extend it by allowing additional elements and attributes. The addTags("mark", "span") method enables support for the <mark> and <span> tags, which are not included in the default basic list. The addAttributes("span", "style") call allows inline styles on <span> elements, while addAttributes("a", "target") permits the use of target="_blank" in anchor tags, which is otherwise stripped for security reasons.

This customization is useful when you want to selectively allow non-standard formatting or attributes for richer user input, while still maintaining basic XSS protection by excluding dangerous tags and JavaScript-based attributes.

6. Conclusion

In this article, we explored how to sanitize HTML input in Java to prevent XSS (Cross-Site Scripting) attacks using both the OWASP Java HTML Sanitizer and the JSoup library. We demonstrated how OWASP’s sanitizer provides strong security through customizable policies using HtmlPolicyBuilder, allowing fine-grained control over allowed tags, attributes, and even element transformations. We also showed how to use JSoup’s Safelist to remove unsafe elements and attributes while supporting common formatting needs, and how to extend it using addTags() and addAttributes() for richer but still safe HTML input.

By leveraging these libraries, developers can confidently clean and render user-generated content while maintaining protection against XSS vulnerabilities.

7. Download the Source Code

This article explored how to sanitize HTML in Java to prevent XSS attacks.

Download
You can download the full source code of this example here: java sanitize html prevent xss attacks

Omozegie Aziegbe

Omos Aziegbe is a technical writer and web/application developer with a BSc in Computer Science and Software Engineering from the University of Bedfordshire. Specializing in Java enterprise applications with the Jakarta EE framework, Omos also works with HTML5, CSS, and JavaScript for web development. As a freelance web developer, Omos combines technical expertise with research and writing on topics such as software engineering, programming, web application development, computer science, and technology.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button