Class Parser

java.lang.Object
eu.maveniverse.domtrip.Parser

public class Parser extends Object
A lossless XML parser that preserves all formatting information including whitespace, comments, attribute quote styles, and entity encoding.

The Parser class is responsible for converting XML text into DomTrip's internal node tree representation. Unlike traditional XML parsers that normalize content and lose formatting information, this parser meticulously preserves every aspect of the original XML formatting to enable perfect round-trip processing.

Parsing Features:

  • Whitespace Preservation - Maintains all whitespace exactly as written
  • Automatic Whitespace Normalization - Never creates Text nodes with only whitespace
  • Attribute Formatting - Preserves quote styles, order, and spacing
  • Comment Preservation - Keeps all XML comments in their original positions
  • Entity Preservation - Maintains entity references in their original form
  • Processing Instructions - Preserves PIs including XML declarations
  • CDATA Sections - Maintains CDATA boundaries and content

Parsing Process:

The parser uses a stack-based approach to build the XML tree:

  1. Tokenizes the input XML character by character
  2. Identifies XML constructs (elements, comments, text, etc.)
  3. Preserves original formatting information for each construct
  4. Automatically normalizes whitespace-only content to element properties
  5. Builds a complete node tree with parent-child relationships
  6. Maintains modification flags for selective formatting preservation

Whitespace Normalization:

The parser automatically normalizes whitespace during parsing to ensure a clean tree structure:

  • No Whitespace-Only Text Nodes - Whitespace between elements is captured in element properties
  • Mixed Content Preservation - Text nodes with actual content preserve their whitespace
  • Lossless Round-Trip - All whitespace is preserved for perfect XML reconstruction
  • Element Properties - Whitespace stored in precedingWhitespace, innerPrecedingWhitespace, etc.

Error Handling:

The parser provides detailed error information for malformed XML:

  • Precise error positions within the source text
  • Descriptive error messages for common XML problems
  • Context information to help locate and fix issues

Usage:

Parser parser = new Parser();
try {
    // Parse from String
    Document document = parser.parse(xmlString);

    // Parse from InputStream with encoding detection
    Document document2 = parser.parse(inputStream);

    // Parse from InputStream with fallback encoding
    Document document3 = parser.parse(inputStream, "UTF-8");

    // Use the parsed document
} catch (DomTripException e) {
    // Handle parsing errors
    System.err.println("Parse error at position " + e.position() + ": " + e.getMessage());
}
See Also:
  • Constructor Details

    • Parser

      public Parser()
      Creates a new Parser instance with default settings.

      No initialization is needed here because the parser state (xml, position, length) is initialized at the start of each parse(String) call.

  • Method Details

    • parse

      public Document parse(InputStream inputStream) throws DomTripException
      Parses XML from an InputStream with automatic encoding detection.

      This method automatically detects the character encoding by:

      1. Checking for a Byte Order Mark (BOM)
      2. Reading the XML declaration to extract the encoding attribute
      3. Falling back to UTF-8 if no encoding is specified

      The resulting Document will have its encoding property set to the detected or declared encoding.

      Parameters:
      inputStream - the InputStream containing XML data
      Returns:
      a Document containing the parsed XML with preserved formatting
      Throws:
      DomTripException - if the XML is malformed, cannot be parsed, or I/O errors occur
    • parse

      public Document parse(InputStream inputStream, String defaultEncoding) throws DomTripException
      Parses XML from an InputStream with encoding detection and fallback.

      This method attempts to detect the character encoding by:

      1. Checking for a Byte Order Mark (BOM)
      2. Reading the XML declaration to extract the encoding attribute
      3. Using the provided default encoding if detection fails

      The resulting Document will have its encoding property set to the detected, declared, or default encoding.

      Parameters:
      inputStream - the InputStream containing XML data
      defaultEncoding - the encoding name to use if detection fails
      Returns:
      a Document containing the parsed XML with preserved formatting
      Throws:
      DomTripException - if the XML is malformed, cannot be parsed, or I/O errors occur
    • parse

      public Document parse(InputStream inputStream, Charset defaultCharset) throws DomTripException
      Parses XML from an InputStream with encoding detection and fallback.

      This method attempts to detect the character encoding by:

      1. Checking for a Byte Order Mark (BOM)
      2. Reading the XML declaration to extract the encoding attribute
      3. Using the provided default charset if detection fails

      The resulting Document will have its encoding property set to the detected, declared, or default encoding.

      Parameters:
      inputStream - the InputStream containing XML data
      defaultCharset - the charset to use if detection fails
      Returns:
      a Document containing the parsed XML with preserved formatting
      Throws:
      DomTripException - if the XML is malformed, cannot be parsed, or I/O errors occur
    • parse

      public Document parse(String xml) throws DomTripException
      Parses an XML string into a lossless XML document tree.

      This method performs complete XML parsing while preserving all formatting information including whitespace, comments, attribute styles, and entity encoding. The resulting Document can be used for lossless round-trip editing.

      Parameters:
      xml - the XML string to parse
      Returns:
      a Document containing the parsed XML with preserved formatting
      Throws:
      DomTripException - if the XML is malformed or cannot be parsed