Visualization and Reporting Enhancements

The upgraded interactive HTML evaluation reports in insideLLMs provide a comprehensive, interactive dashboard for analyzing experiment results. These reports are self-contained HTML files that embed experiment data, interactive visualizations, and a modern, responsive UI, making them easy to share and explore.

Dashboard Features#

The report dashboard includes multiple interactive Plotly charts, each accessible via tabbed navigation. Visualizations include accuracy comparison, latency distribution, performance radar, performance heatmap, token usage by model, and result status breakdown. These charts enable rapid comparison of models and probes across key metrics, helping to identify strengths, weaknesses, and outliers in experiment results. Each chart updates dynamically in response to filtering and theming changes, ensuring a consistent analytical experience [source].

A summary statistics grid at the top of the report displays aggregate metrics, including the number of experiments, total results, success rate, average accuracy, average latency, total tokens, and counts of unique models and probes. Below the charts, side-by-side model comparison cards present per-model metrics for accuracy, latency, success rate, and token usage, supporting quick visual benchmarking.

Filtering, Pagination, and Sorting#

The report provides filtering controls for model, probe, provider, and free-text search. These filters update the results table, model comparison cards, and visualizations in real time. The results table supports column sorting by clicking headers and paginates results (20 per page) with navigation controls. Filtering and sorting work together, allowing users to focus on specific subsets of results or compare particular models and probes [source].

Theming#

A dark/light mode toggle in the header switches the report's color scheme. The selected theme is persisted in localStorage and updates both the CSS and Plotly chart backgrounds and fonts for optimal readability. This ensures the report is accessible and visually consistent in different environments and for different user preferences [source].

Export Options#

Export buttons in the header allow users to download the embedded experiment data as JSON or CSV files. The CSV export includes key metrics such as model, provider, probe, category, total, success, accuracy, precision, recall, F1, latency, and tokens. This facilitates further analysis in external tools or sharing results with collaborators [source].

Results Table and Details#

The results table presents experiment outcomes with sortable columns for model, probe, category, total, success, accuracy, latency, and tokens. Each row can be expanded to reveal detailed metrics (precision, recall, F1, error rate) and sample results, supporting in-depth inspection of individual experiments. The table is fully searchable and paginated, making it practical to navigate large experiment sets.

Generating Reports#

To generate an upgraded interactive HTML evaluation report, use the create_interactive_html_report function in Python, or the CLI after running experiments. For example:

from insideLLMs.visualization import create_interactive_html_report

# experiments: list of ExperimentResult objects
create_interactive_html_report(
    experiments,
    title="My LLM Evaluation Report",
    save_path="report.html",
    include_raw_results=True,
    include_individual_results=True,
    embed_plotly_js=False # Use CDN for smaller file, True for fully offline
)

Or from the CLI:

insidellms harness harness.yaml --run-dir ./runs/baseline
insidellms report ./runs/baseline

The generated report.html will be fully interactive if Plotly is installed; otherwise, a static HTML report is produced [source].

Customization#

At generation time, you can customize the report by setting parameters such as title, save_path, include_raw_results, include_individual_results, and embed_plotly_js. For example, you may exclude the raw results table or individual expandable results for a more compact report. In the UI, users can interactively filter, sort, and paginate results, switch themes, and export data.

Interpreting Reports and Supporting Analysis#

The interactive dashboard supports comprehensive analysis by enabling users to:

Compare model performance across multiple metrics and probes using visualizations and comparison cards.
Drill down into individual experiment results for detailed inspection.
Filter and search results to focus on specific models, probes, or error cases.
Export data for further analysis or sharing.
Switch between dark and light themes for accessibility and presentation.

These features make it easy to identify trends, regressions, and outliers, and to communicate findings with stakeholders. The self-contained HTML format ensures that both the visualizations and underlying data are portable and reproducible [source].

Example Workflow#

Run your experiments and collect results.
Generate the interactive HTML report using the CLI or Python API.
Open the report in a browser to explore summary statistics, charts, and detailed results.
Use filtering, sorting, and theming to focus your analysis.
Export data as needed for further processing or sharing.

The upgraded interactive HTML evaluation reports provide a powerful, user-friendly interface for analyzing and sharing LLM experiment results.