Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

remi.mahmoud@institut-agro.fr
Rencontres R 2025
Joint work with S. Lê

Rémi Mahmoud

2025-05-20

What today’s talk is about

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

What today’s talk is about

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

What today’s talk is about

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

What today’s talk is about

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

NaileR: leverage the power of LLM to guide factorial analysis

NaileR, what for ?

Context: a survey about 8 types of beard, 64 respondents give words to describe each beard

library(NaileR)
data(beard)
head(beard, n = 8)
# A tibble: 8 × 2
  Stimuli Description               
  <chr>   <chr>                     
1 B1      elegant;confident;arrogant
2 B2      young;sporty;energetic    
3 B3      mature;hipster;photo model
4 B4      untidy;gamer;student      
5 B5      small;married;jokester    
6 B6      laid-back;bon vivant;messy
7 B7      ordinary;tidy             
8 B8      Gaulish;artist;passionate 
data(beard_cont)
words<- c("arrogant", "elegant",
          "untidy", "ordinary")
head(beard_cont[,words], n = 8)
   arrogant elegant untidy ordinary
B1        1       5      1        0
B2        0       1      5        0
B3        0       0      7        0
B4        0       2      6        2
B5        0       0      0        0
B6        0       0      6        0
B7        0       0      6        1
B8        0       0      0        0

NaileR, what for ?

intro_beard <- 'A survey was conducted about beards
and 8 types of beards were described.
In the data that follow, beards are named B1 to B8.'

# text pre-processing
intro_beard <- gsub('\n', ' ', intro_beard) |>  stringr::str_squish()

# request
req_beard <- 'Please give a name to each beard
and summarize what makes this beard unique.'
req_beard <- gsub('\n', ' ', req_beard) |>  stringr::str_squish()

res_beard <- nail_descfreq(beard_cont, introduction = intro_beard, 
                           request = req_beard, generate = TRUE)
cat(res_beard$response)
Based on the provided data, I will give a name to each beard and summarize what makes it unique:

**B1: The Sophisticate**
This beard is characterized by being neat, clean, classic, modern, elegant, confident, Parisian bobo, and educated. It's a stylish and refined beard that exudes sophistication.

**B2: The Charmer**
This beard is young, sporty, charming, seductive, haughty, fashionable, and active. It's a youthful and energetic beard that oozes charm and charisma.

**B3: The Hipster Lumberjack**
This beard is hipster, lumberjack, strong, and checkered-shirt-inspired. It's a rugged and edgy beard that blends alternative styles with outdoor enthusiasm.

**B4: The Everyman**
This beard is fat, busy, sluggish, geeky, simple, dad-like, and common. It's an unassuming and down-to-earth beard that represents the average person.

**B5: The Goatee Geek**
This beard features a goatee, corniness, fifties-inspired style, glasses-wearing, sneakiness, slimness, and quirky scientist vibes. It's a nerdy and charming beard with a hint of playfulness.

**B6: The Fussy Cut**
This beard is cut, thirties-inspired, poorly cut, and picky. It's a fussbudget beard that values precision and neatness over anything else.

**B7: The Outdated Shy Guy**
This beard is shy, old-fashioned, ridiculous, creepy, outdated, and beardless (or at least, attempting to be). It's an awkward and introverted beard that may not fit in with the times.

**B8: The Bohemian Biker**
This beard is boorish, Mexican-inspired, biker-like, old, artistic, vintage, Italian-inspired, and trucker-inspired. It's a free-spirited and rugged beard that embodies the spirit of adventure and nonconformity.

NaileR, how ? (under the hood)

  1. Describes relationships between variables (quali/quali, quanti/quali, quanti/quanti) using FactoMineR functions (descfreq, catdes, condes, etc.)

  2. Provides the results of step 1 to a LLM run locally via Ollama

A piece of code inside NaileR

ppt = glue("# Introduction
             {introduction}
             # Task
             {request}
             # Data
             {get_sentences_descfreq(res_df, isolate.groups = isolate.groups)}")

    res_llm = ollamar::generate(model = model, prompt = ppt, output = 'df')
    res_llm$prompt = ppt
    return(res_llm)
  • Clear documentation on the Github repo of the package

  • NaileR leverages the power of LLM to describe the results via a binding with Ollama

(Dis)advantages of Ollama:

  • Run an LLM locally
  • Multiple choices of LLM (llama 3, mistral, etc.)
  • Allow confidential results

Semantic proximity to quantify uncertainty

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

LLM are inherently stochastic

  1. Project the (tokenized) text in a HD space, with the context taken into account (embedding)

  2. Relationships learned

  3. Token prediction (sampled from a distribution)

LLM: How to measure uncertainty

  • Main idea: if the LLM gives semantically different responses to the same prompt called different times, then the answer is uncertain

A three-step approach proposed by (Lin et al. 20241)

Similarity between a set of responses

  • Given a set of responses \((s_1, ..., s_m)\)

  • Embed every response using an embedding model (BERT, etc.) \(\Rightarrow \ S_i \in \mathbb{R}^d\)

  • \(\forall i,j \in 1,...,m: \ A \in \mathbb{M}_m \ a_{ij} = sim(s_i, s_j) = \frac{< S_i, S_j>}{\lVert S_1 \rVert \lVert S_2 \rVert}\)

  • Consider A as an adjacency matrix of a graph

\[ \left(\begin{array}{ccccccc} & s_1 & s_2 & s_3 & s_4 & s_5 & s_6 & s_7 \\ s_1 & 1.00 & 0.49 & -0.07 & -0.02 & 0.02 & -0.04 & 0.07 \\ s_2 & 0.49 & 1.00 & 0.28 & 0.14 & 0.12 & -0.02 & 0.04 \\ s_3 & -0.07 & 0.28 & 1.00 & 0.04 & 0.05 & -0.02 & -0.03 \\ s_4 & -0.02 & 0.14 & 0.04 & 1.00 & 0.64 & -0.03 & -0.02 \\ s_5 & 0.02 & 0.12 & 0.05 & 0.64 & 1.00 & 0.02 & 0.00 \\ s_6 & -0.04 & -0.02 & -0.02 & -0.03 & 0.02 & 1.00 & 0.43 \\ s_7 & 0.07 & 0.04 & -0.03 & -0.02 & 0.00 & 0.43 & 1.00 \\ \end{array}\right) \Longrightarrow \]

  • Question: how many categories (2 ?,3 ?) ?

Uncertainty scores

  • From a weighted adjacency matrix \(W\), derive the normalized Laplacian

  • \(L = I - D^{-\frac{1}{2}}WD^{-\frac{1}{2}}, \ \text{with } D\) the degree matrix (\(\sum\) of incident edges)

  • \(L\) represents how much each node differs from the others, weighted by the “number” of incident nodes (degrees)

  • \(L\) has good spectral properties

  • Let \(v_1, ..., v_m\) denote the \(m\) eigenvectors of \(L\) with its associated eigenvalues \(\lambda_1, ...,\lambda_m\)

We can now derive several metrics

Uncertainty and confidence metrics on a graph-based representation of responses
Metric Formula Meaning Value
Spectral Uncertainty \(U_{\text{eigV}}\) \(\sum_k \max(0,\ 1 - \lambda_k)\) Measures the number of communities (continuous version) in the graph 2.42
Graph Degree Uncertainty \(U_{\text{Deg}}\) \(\frac{\text{trace}(mI - D)}{m^2}\) Average pairwise distance. 0.75
Degree-Based Confidence \(C_{\text{Deg}}(s_j)\) \(\frac{D_{j,j}}{m}\) Local confidence of response \(s_j\) \(C_{\text{Deg}}(\text{Feline}) = 0.08\)
Eccentricity-Based Confidence \(C_{\text{Ecc}}(s_j)\) \(-\|v_j\|_2\) Measures the distance of response \(s_j\) from the graph center; more negative means less typical. \(C_{\text{Ecc}}(\text{Parrot}) = -1.26\)

  • How to adapt this framework to NaileR package ?

TrustMe: evaluate semantic proximities between responses generated by the NaileR package

Strengthening confidence in LLM-generated responses with TrustMe: a package for evaluating semantic proximities between responses generated by the NaileR package

Three consistencies to evaluate

What are the sources of uncertainty ?

Internal-Model consistency: how do a single LLM generate different responses to the same prompt ?

Cross-Model consistency: how do different LLM behave differently ?

Data-model consistency: do sibling prompts, generated by samples of the same dataset generate similar responses ?

TrustMe: main functions

As a starter:

Code
library(NaileR)
# prompt generation
res_beard <- nail_descfreq(beard_cont,
                           introduction = intro_beard,
                           request = req_beard,
                           generate = FALSE) # generate = FALSE to just get a prompt

Internal-Model consistency: trustme_models(prompt, models = "llama3", num_repeats = m)

Code
library(TrustMe)
res_trustme_internal_model <- trustme_models(res_beard,
                              models = c("llama3"), num_repeats = 10)

Cross-Model consistency: trustme_models(prompt, models = c("llama2","llama3","llama3.1", "mistral"), num_repeats = 1)

Code
res_trustme_models <- trustme_models(res_beard,
                              models = c("llama2", "llama3", "llama3.1","mistral"))

Data-model consistency: trustme_prompts(prompt, model = "llama3")

Code
library(NaileR)

prompt_list <- list()
num_repeats <- 5

# Generate a prompt list based on a subsampling of the data
for (i in 1:num_repeats) {
  prompt <- nail_catdes(don_clust_waste,
                        num.var = ncol(don_clust_waste),
                        introduction = intro_waste,
                        request = req_waste,
                        quali.sample = 0.35,
                        quanti.sample = 1,
                        drop.negative = FALSE,
                        generate = FALSE,
                        proba = 0.25)

  # Store with a unique name
  prompt_list[[paste0("nail_catdes_", i)]] <- prompt
}

res_trustme_data_model <- trustme_prompt(prompt_list,
                              models = c("llama3"))

TrustMe: output

names(res_trustme_models)
[1] "responses"      "central_answer" "pca"            "pca_plot"      
  • responses: a named list with all the responses
  • central_answer: the “central” response (based on a partition around medoids of the embedded responses)
  • pca: the result of a PCA computed on the embeddings of all responses
  • pca_plot: the first two principal components of the embedded responses
res_trustme_models$pca_plot

TrustMe: uncertainty computations

uncertainty = compute_uncertainty_metrics(vec_text = unlist(res_trustme_models$responses))
names(uncertainty)
[1] "laplacian_normalized" "similarity_matrix"    "UeigV"                "eigen_values"         "Udeg"                 "Cdeg"                 "Uecc"                 "Cecc"                
[9] "W"                   
Barbe llama2 llama3 mistral
B1 The Parisian Bobo The Classic Neat & Classic
B2 The Young Adult The Young Charmer Young and Sporty beard
B3 The Hipster The Hipster Lumberjack Hipster Beard
cowplot::plot_grid(plotlist = plot_uncertainty(uncertainty))

Perspectives

Development version available on Github/RemiMahmoud/TrustMe

devtools::install_github("RemiMahmoud/TrustMe")

Some things need to be digged further

  1. Use of Natural Language Inference (NLI) to measure similarity

  2. Make computations quicker (multiple calls to LLM via Ollama (local) are intensive and time consuming !)

  3. CRAN submission