Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

remi.mahmoud@institut-agro.fr
Rencontres R 2025
Joint work with S. Lê

Rémi Mahmoud

2025-05-20

What today’s talk is about

What today’s talk is about

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

What today’s talk is about

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

What today’s talk is about

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

NaileR: leverage the power of LLM to guide factorial analysis

NaileR, what for ?

Context: a survey about 8 types of beard, 64 respondents give words to describe each beard

library(NaileR)
data(beard)
head(beard, n = 8)

# A tibble: 8 × 2
  Stimuli Description               
  <chr>   <chr>                     
1 B1      elegant;confident;arrogant
2 B2      young;sporty;energetic    
3 B3      mature;hipster;photo model
4 B4      untidy;gamer;student      
5 B5      small;married;jokester    
6 B6      laid-back;bon vivant;messy
7 B7      ordinary;tidy             
8 B8      Gaulish;artist;passionate

data(beard_cont)
words<- c("arrogant", "elegant",
          "untidy", "ordinary")
head(beard_cont[,words], n = 8)

   arrogant elegant untidy ordinary
B1        1       5      1        0
B2        0       1      5        0
B3        0       0      7        0
B4        0       2      6        2
B5        0       0      0        0
B6        0       0      6        0
B7        0       0      6        1
B8        0       0      0        0

NaileR, what for ?

intro_beard <- 'A survey was conducted about beards
and 8 types of beards were described.
In the data that follow, beards are named B1 to B8.'

# text pre-processing
intro_beard <- gsub('\n', ' ', intro_beard) |>  stringr::str_squish()

# request
req_beard <- 'Please give a name to each beard
and summarize what makes this beard unique.'
req_beard <- gsub('\n', ' ', req_beard) |>  stringr::str_squish()

res_beard <- nail_descfreq(beard_cont, introduction = intro_beard, 
                           request = req_beard, generate = TRUE)
cat(res_beard$response)

Based on the provided data, I will give a name to each beard and summarize what makes it unique:

**B1: The Sophisticate**
This beard is characterized by being neat, clean, classic, modern, elegant, confident, Parisian bobo, and educated. It's a stylish and refined beard that exudes sophistication.

**B2: The Charmer**
This beard is young, sporty, charming, seductive, haughty, fashionable, and active. It's a youthful and energetic beard that oozes charm and charisma.

**B3: The Hipster Lumberjack**
This beard is hipster, lumberjack, strong, and checkered-shirt-inspired. It's a rugged and edgy beard that blends alternative styles with outdoor enthusiasm.

**B4: The Everyman**
This beard is fat, busy, sluggish, geeky, simple, dad-like, and common. It's an unassuming and down-to-earth beard that represents the average person.

**B5: The Goatee Geek**
This beard features a goatee, corniness, fifties-inspired style, glasses-wearing, sneakiness, slimness, and quirky scientist vibes. It's a nerdy and charming beard with a hint of playfulness.

**B6: The Fussy Cut**
This beard is cut, thirties-inspired, poorly cut, and picky. It's a fussbudget beard that values precision and neatness over anything else.

**B7: The Outdated Shy Guy**
This beard is shy, old-fashioned, ridiculous, creepy, outdated, and beardless (or at least, attempting to be). It's an awkward and introverted beard that may not fit in with the times.

**B8: The Bohemian Biker**
This beard is boorish, Mexican-inspired, biker-like, old, artistic, vintage, Italian-inspired, and trucker-inspired. It's a free-spirited and rugged beard that embodies the spirit of adventure and nonconformity.

NaileR, how ? (under the hood)

Describes relationships between variables (quali/quali, quanti/quali, quanti/quanti) using FactoMineR functions (descfreq, catdes, condes, etc.)
Provides the results of step 1 to a LLM run locally via Ollama

A piece of code inside NaileR

ppt = glue("# Introduction
             {introduction}
             # Task
             {request}
             # Data
             {get_sentences_descfreq(res_df, isolate.groups = isolate.groups)}")

    res_llm = ollamar::generate(model = model, prompt = ppt, output = 'df')
    res_llm$prompt = ppt
    return(res_llm)

Clear documentation on the Github repo of the package
NaileR leverages the power of LLM to describe the results via a binding with Ollama

(Dis)advantages of Ollama:

Run an LLM locally
Multiple choices of LLM (llama 3, mistral, etc.)
Allow confidential results

Semantic proximity to quantify uncertainty

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

LLM are inherently stochastic

Project the (tokenized) text in a HD space, with the context taken into account (embedding)
Relationships learned
Token prediction (sampled from a distribution)

Visualizing transformers and attention - Grant Sanderson (Owner of the 3Blue1Brown channel)

LLM: How to measure uncertainty

Main idea: if the LLM gives semantically different responses to the same prompt called different times, then the answer is uncertain

A three-step approach proposed by (Lin et al. 2024¹)

Similarity between a set of responses

Given a set of responses \((s_1, ..., s_m)\)
Embed every response using an embedding model (BERT, etc.) \(\Rightarrow \ S_i \in \mathbb{R}^d\)
\(\forall i,j \in 1,...,m: \ A \in \mathbb{M}_m \ a_{ij} = sim(s_i, s_j) = \frac{< S_i, S_j>}{\lVert S_1 \rVert \lVert S_2 \rVert}\)
Consider A as an adjacency matrix of a graph

\[ \left(\begin{array}{ccccccc} & s_1 & s_2 & s_3 & s_4 & s_5 & s_6 & s_7 \\ s_1 & 1.00 & 0.49 & -0.07 & -0.02 & 0.02 & -0.04 & 0.07 \\ s_2 & 0.49 & 1.00 & 0.28 & 0.14 & 0.12 & -0.02 & 0.04 \\ s_3 & -0.07 & 0.28 & 1.00 & 0.04 & 0.05 & -0.02 & -0.03 \\ s_4 & -0.02 & 0.14 & 0.04 & 1.00 & 0.64 & -0.03 & -0.02 \\ s_5 & 0.02 & 0.12 & 0.05 & 0.64 & 1.00 & 0.02 & 0.00 \\ s_6 & -0.04 & -0.02 & -0.02 & -0.03 & 0.02 & 1.00 & 0.43 \\ s_7 & 0.07 & 0.04 & -0.03 & -0.02 & 0.00 & 0.43 & 1.00 \\ \end{array}\right) \Longrightarrow \]

Question: how many categories (2 ?,3 ?) ?

Uncertainty scores

From a weighted adjacency matrix \(W\), derive the normalized Laplacian
\(L = I - D^{-\frac{1}{2}}WD^{-\frac{1}{2}}, \ \text{with } D\) the degree matrix (\(\sum\) of incident edges)
\(L\) represents how much each node differs from the others, weighted by the “number” of incident nodes (degrees)
\(L\) has good spectral properties
Let \(v_1, ..., v_m\) denote the \(m\) eigenvectors of \(L\) with its associated eigenvalues \(\lambda_1, ...,\lambda_m\)

We can now derive several metrics

Uncertainty and confidence metrics on a graph-based representation of responses
Metric	Formula	Meaning	Value
Spectral Uncertainty \(U_{\text{eigV}}\)	\(\sum_k \max(0,\ 1 - \lambda_k)\)	Measures the number of communities (continuous version) in the graph	2.42
Graph Degree Uncertainty \(U_{\text{Deg}}\)	\(\frac{\text{trace}(mI - D)}{m^2}\)	Average pairwise distance.	0.75
Degree-Based Confidence \(C_{\text{Deg}}(s_j)\)	\(\frac{D_{j,j}}{m}\)	Local confidence of response \(s_j\)	\(C_{\text{Deg}}(\text{Feline}) = 0.08\)
Eccentricity-Based Confidence \(C_{\text{Ecc}}(s_j)\)	\(-\\|v_j\\|_2\)	Measures the distance of response \(s_j\) from the graph center; more negative means less typical.	\(C_{\text{Ecc}}(\text{Parrot}) = -1.26\)

How to adapt this framework to NaileR package ?

TrustMe: evaluate semantic proximities between responses generated by the NaileR package

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

Three consistencies to evaluate

What are the sources of uncertainty ?

Internal-Model consistency: how do a single LLM generate different responses to the same prompt ?

Cross-Model consistency: how do different LLM behave differently ?

Data-model consistency: do sibling prompts, generated by samples of the same dataset generate similar responses ?

TrustMe: main functions

As a starter:

Code

library(NaileR)
# prompt generation
res_beard <- nail_descfreq(beard_cont,
                           introduction = intro_beard,
                           request = req_beard,
                           generate = FALSE) # generate = FALSE to just get a prompt

Internal-Model consistency: trustme_models(prompt, models = "llama3", num_repeats = m)

Code

library(TrustMe)
res_trustme_internal_model <- trustme_models(res_beard,
                              models = c("llama3"), num_repeats = 10)

Cross-Model consistency: trustme_models(prompt, models = c("llama2","llama3","llama3.1", "mistral"), num_repeats = 1)

Code

res_trustme_models <- trustme_models(res_beard,
                              models = c("llama2", "llama3", "llama3.1","mistral"))

Data-model consistency: trustme_prompts(prompt, model = "llama3")

Code

library(NaileR)

prompt_list <- list()
num_repeats <- 5

# Generate a prompt list based on a subsampling of the data
for (i in 1:num_repeats) {
  prompt <- nail_catdes(don_clust_waste,
                        num.var = ncol(don_clust_waste),
                        introduction = intro_waste,
                        request = req_waste,
                        quali.sample = 0.35,
                        quanti.sample = 1,
                        drop.negative = FALSE,
                        generate = FALSE,
                        proba = 0.25)

  # Store with a unique name
  prompt_list[[paste0("nail_catdes_", i)]] <- prompt
}

res_trustme_data_model <- trustme_prompt(prompt_list,
                              models = c("llama3"))

TrustMe: output

names(res_trustme_models)

[1] "responses"      "central_answer" "pca"            "pca_plot"

responses: a named list with all the responses
central_answer: the “central” response (based on a partition around medoids of the embedded responses)
pca: the result of a PCA computed on the embeddings of all responses
pca_plot: the first two principal components of the embedded responses

res_trustme_models$pca_plot

TrustMe: uncertainty computations

uncertainty = compute_uncertainty_metrics(vec_text = unlist(res_trustme_models$responses))

names(uncertainty)

[1] "laplacian_normalized" "similarity_matrix"    "UeigV"                "eigen_values"         "Udeg"                 "Cdeg"                 "Uecc"                 "Cecc"                
[9] "W"

Barbe	llama2	llama3	mistral
B1	The Parisian Bobo	The Classic	Neat & Classic
B2	The Young Adult	The Young Charmer	Young and Sporty beard
B3	The Hipster	The Hipster Lumberjack	Hipster Beard

cowplot::plot_grid(plotlist = plot_uncertainty(uncertainty))

Perspectives

Development version available on Github/RemiMahmoud/TrustMe

devtools::install_github("RemiMahmoud/TrustMe")

Some things need to be digged further

Use of Natural Language Inference (NLI) to measure similarity
Make computations quicker (multiple calls to LLM via Ollama (local) are intensive and time consuming !)
CRAN submission