Owl.Data ANSI Escape Sequence Conversion#

Overview#

The Owl.Data module converts ANSI escape sequences into an internal tag-based data representation. This conversion transforms raw chardata containing escape sequences into structured Owl.Tag structures that encapsulate both content and styling. This approach eliminates the need to manually track ANSI sequence state and enables composable, type-safe terminal output formatting.

Internal Data Representation#

The Owl.Tag Structure#

The internal representation uses the Owl.Tag struct:

%Owl.Tag{
  sequences: [Owl.Data.sequence()], # List of ANSI sequences (atoms or binaries)
  data: data # The actual content
}

Each tag wraps content with a list of sequences representing styling attributes like colors, text effects, and hyperlinks. Sequences can be:

Atoms for named attributes (:red, :underline, :bright)
Binaries for extended color codes
Tuples for hyperlinks ({:hyperlink, url})

Recursive Type Definition#

The Owl.Data.t() type is defined recursively to support nested, composable structures:

@type t :: [binary() | non_neg_integer() | t() | Owl.Tag.t(t())] | Owl.Tag.t(t()) | binary()

Key recursive elements:

Lists can contain t() itself, allowing arbitrary nesting of data structures
Owl.Tag.t(t()) wraps data of type t(), enabling tags to contain other tags
Tags are themselves valid t() values, making them first-class composable elements
Includes non_neg_integer() to support charlists (lists of integer codepoints)

This recursive definition mirrors Elixir's IO.chardata but extends it with first-class styling support, enabling unlimited nesting like Owl.Data.tag(["Hello ", Owl.Data.tag("world", :green), "!"], :red).

The Owl.Data.Sequence Module#

The Owl.Data.Sequence module handles the low-level work of identifying and grouping ANSI escape sequences. It's an internal module that translates between raw escape codes and Owl's structured representation.

Identifying Escape Sequences#

The parse_many/1 function handles two types of ANSI sequences:

CSI (Control Sequence Introducer) sequences starting with "\e[" - used for colors and text effects
OSC (Operating System Command) sequences starting with "\e]8" - used for hyperlinks

The parse/1 function identifies individual sequences through pattern matching:

Named color and effect sequences (:red, :underline, :bright, etc.)
Hyperlink sequences
256-color sequences ("\e[38;5;" for foreground, "\e[48;5;" for background)
RGB color sequences ("\e[38;2;" for foreground, "\e[48;2;" for background)

Grouping Escape Sequences#

ANSI sequences often contain multiple parameters separated by semicolons. The Sequence module intelligently groups these into complete semantic units.

extract_csi_attributes/1 parses CSI parameters by splitting on semicolons and extracting integers. For example, "\e[31;42m" becomes [31, 42].

chunk_csi_attributes/1 groups related attributes into complete sequences:

256-color patterns: [38, 5, n] or [48, 5, n] (foreground/background with color index)
RGB patterns: [38, 2, r, g, b] or [48, 2, r, g, b] (foreground/background with RGB values)
Individual attributes: Formatted as standalone sequences

This grouping ensures that multi-parameter sequences like true color codes are kept together rather than being split into meaningless individual numbers.

Sequence Types#

The module categorizes sequences by type:

Colors: :foreground and :background (8 basic colors plus light variants)
Text effects: :blink, :intensity, :underline, :italic, :overlined, :inverse, :reverse
Hyperlinks: :hyperlink for OSC 8 sequences

This categorization enables the conversion algorithm to track which sequences are active and handle conflicts (e.g., when multiple foreground colors are specified).

Conversion Process: from_chardata/1#

The from_chardata/1 function is the main entry point for converting raw ANSI sequences into tagged structures:

def from_chardata(data) do
  data =
    Regex.split(~r/(\e\[(\d+;)*\d+m)|(\e\]8;.*?;.*?\e\\)/, IO.chardata_to_string(data),
      include_captures: true,
      trim: true
    )
  {data, _open_tags} = do_from_chardata(data, %{})
  data
end

The conversion process:

Normalizes input using IO.chardata_to_string/1 (handles strings, charlists, and mixed data)
Splits input using a regex that captures both CSI and OSC sequences
Recursively processes the split data with do_from_chardata/2
Returns the structured tagged data

The regex preserves the escape sequences in the split output (include_captures: true), allowing them to be processed alongside the text content.

Recursive Construction#

The do_from_chardata/2 Algorithm#

The do_from_chardata/2 function processes data recursively using pattern matching with multiple clauses:

Base cases:

Binary strings are parsed for sequences or tagged based on active sequences
Empty lists return immediately to avoid unnecessary structure
Single-element lists unwrap and recurse to simplify the output

Recursive case for lists:

The list processing clause follows a classic recursive pattern:

Recursively processes the head element
Recursively processes the tail with the updated state
Attempts to merge adjacent tags with identical sequences
Returns the combined result with updated state

This recursive approach naturally handles arbitrarily nested structures while maintaining consistent state throughout the traversal.

State Management#

The algorithm maintains an open_tags map that tracks which sequences are currently active. This state is threaded through all recursive calls to:

Properly nest tags based on active sequences
Handle sequence updates - :reset clears all tags, default values remove specific tag types
Enable intelligent merging of adjacent identical tags

The state threading ensures that when a sequence like "\e[31m" (red) is encountered, all subsequent content is wrapped in a red tag until a reset or different color is encountered.

Merging Neighboring Identical Tags#

One of the key optimizations in Owl.Data is the automatic merging of adjacent tags with identical sequences. This creates cleaner, more efficient data structures.

The Merging Logic#

Tag merging is implemented in do_from_chardata/2 and checks two patterns:

Pattern 1: Two consecutive tags with identical sequences

{%Owl.Tag{data: p1, sequences: s}, %Owl.Tag{data: p2, sequences: s}} ->
  data =
    if is_list(p2) do
      [p1 | p2]
    else
      [p1, p2]
    end

  {tag(data, s), open_tags}

Pattern 2: Tag followed by a list starting with a tag with identical sequences

{%Owl.Tag{data: p1, sequences: s}, [%Owl.Tag{data: p2, sequences: s} | rest]} ->
  data =
    if is_list(p2) do
      [p1 | p2]
    else
      [p1, p2]
    end

  {[tag(data, s) | rest], open_tags}

Both patterns verify that the sequences field is exactly identical and combine the data into a single tag. The merged tag wraps the combined content with the shared sequences.

When Merging Occurs#

Tag merging happens:

During from_chardata/1 operation when reconstructing from external sources
When the recursive traversal encounters adjacent tags with the same sequence list
Only when sequences are exactly identical (same elements, same order)

The merging does not occur during the reverse operation (to_chardata/1), which means adjacent identical tags will generate redundant ANSI escape sequences in the output.

Optimization Benefits#

Reduced memory overhead: A single tag wraps all related content instead of multiple separate tag structures
Simpler data structure: The resulting structure is easier to inspect and debug
More semantic representation: Logically groups content that shares formatting, making the intent clearer

For example, when processing output from an external program that outputs "\e[31mHello\e[0m\e[31m?!\e[0m", the merging optimization recognizes that both "Hello" and "?!" share the same red color and combines them into a single tag.

Before/After Conversion Examples#

Example 1: Basic Red Text Merging#

Before merging:

[Owl.Data.tag("Hello", :red), Owl.Data.tag("?!", :red)]

After merging:

Owl.Data.tag(["Hello", "?!"], :red)

The two separate red tags are merged into a single tag containing both strings.

Example 2: TrueColor Merging#

Before merging:

[
  Owl.Data.tag("#", Owl.TrueColor.color(253, 151, 31)),
  Owl.Data.tag(" ", Owl.TrueColor.color(253, 151, 31)),
  Owl.Data.tag("Owl", Owl.TrueColor.color(253, 151, 31))
]

After merging:

Owl.Data.tag(["#", " ", "Owl"], Owl.TrueColor.color(253, 151, 31))

Three separate tags using the same TrueColor orange are merged into one, significantly simplifying the structure.

Example 3: ANSI Sequence Conversion#

From the test suite:

# Basic conversion
[:red, "hello"] |> IO.ANSI.format() |> Owl.Data.from_chardata()
# => Owl.Data.tag("hello", :red)

# Multiple attributes in one sequence
Owl.Data.from_chardata("\e[31;42mHello\e[0m")
# => Owl.Data.tag("Hello", [:red, :green_background])

# True color support with multiple attributes
Owl.Data.from_chardata("\e[4;38;2;166;226;46;48;2;33;39;112mHello\e[0m")
# => Owl.Data.tag("Hello", [:underline, Owl.TrueColor.color(166, 226, 46), 
# Owl.TrueColor.color_background(33, 39, 112)])

These examples demonstrate how raw ANSI sequences are parsed into structured tags, with multiple attributes correctly grouped together.

Example 4: Nested Tag Construction#

From tests:

Owl.Data.tag(
  [
    "hi",
    "a\nand new",
    " line ",
    Owl.Data.tag(" hey\n aloha", :red), # Nested tag
    "!!"
  ],
  :green
)

This example shows nested tags where red text is embedded within green text. The recursive structure naturally represents this hierarchy, and operations like splitting or slicing preserve the nesting.

Charlist Handling Improvements#

Owl.Data provides robust support for charlists (lists of integer codepoints), treating them as first-class data types alongside strings. This improves compatibility with Elixir's standard IO.chardata and enables seamless handling of mixed data types.

Native Integer Support#

The type definition explicitly includes non_neg_integer(), making charlists first-class citizens alongside strings and tags. This design choice ensures that any valid IO.chardata can be processed through Owl.Data without special handling.

Automatic Character Conversion#

Integer codepoints are automatically converted to UTF-8 strings when processed by internal functions:

defp do_chunk_by(value, chunk_acc, chunk_fun, acc, acc_sequences) when is_integer(value) do
  do_chunk_by(<<value::utf8>>, chunk_acc, chunk_fun, acc, acc_sequences)
end

This pattern appears throughout the module, ensuring integers are handled correctly during:

Length calculation
Untagging operations
Chunking and splitting operations

Seamless ANSI Integration#

The from_chardata/1 function uses IO.chardata_to_string/1 to normalize input at the start of conversion. This allows charlists mixed with ANSI escape sequences to be parsed into tagged structures without any special handling.

Charlist Examples#

From tests:

# Converting charlists with ANSI codes
Owl.Data.from_chardata(["\e[31m", ~c"Hello"]) == Owl.Data.tag("Hello", :red)

# Splitting charlists
Owl.Data.split(~c"hello", "e") == ["h", ["l", "l", "o"]]

# Untagging preserves charlist structure
Owl.Data.tag([72, 101, 108, 108, 111], :red) |> Owl.Data.untag()
# => ~c"Hello"

These examples show that charlists work seamlessly with ANSI sequences, splitting operations preserve structure, and untagging can return charlists when appropriate.

Benefits#

Compatibility: Works with standard Elixir IO.chardata types without conversion
Preservation: Charlists maintain their structure through tag operations
Consistency: Operations like splitting, slicing, and length calculation work uniformly across strings, charlists, and mixed data
ANSI sequence handling: Charlists can be seamlessly mixed with escape sequences and converted to tagged representations

This design ensures that code accepting IO.chardata can be easily extended to use Owl.Data's tagged representation without breaking compatibility with existing charlists.