Skip to content

add diff diagnostics#300

Open
avehtari wants to merge 59 commits intomasterfrom
diff-diagnostics
Open

add diff diagnostics#300
avehtari wants to merge 59 commits intomasterfrom
diff-diagnostics

Conversation

@avehtari
Copy link
Copy Markdown
Member

@avehtari avehtari commented Aug 31, 2025

Description

The corresponding issue #299

In this PR we are making changes to the output structure returned by loo_compare(). The main changes are a transition from a matrix to a data frame and the addition of several new columns. Here is an example of the updated output structure:

> loo_compare(w1, w2)
  model elpd_diff se_diff p_worse diag_diff diag_elpd
 model1       0.0     0.0      NA                    
 model2      -4.1     0.1    1.00   N < 100          
 
Diagnostic flags present.
See ?`loo-glossary` (sections `diag_diff` and `diag_elpd`)
or https://mc-stan.org/loo/reference/loo-glossary.html.

More context on the motivation behind these changes can be found in the updated case study "Uncertainty in Bayesian LOO-CV Model Comparison".

TODOs

  • implement changes and corresponding tests
  • update NEWS.md
  • fix all reverse dependency issues (see Comment)

@avehtari avehtari requested review from jgabry and n-kall August 31, 2025 17:00
@avehtari
Copy link
Copy Markdown
Member Author

@jgabry , we discussed this with @n-kall, and I added an option to the print function so that the new information is not shown by default. Would you have time to look at this?

Copy link
Copy Markdown
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think this is a good idea. It's possible that this doesn't actually break any packages that depend on loo. I can check by running reverse dependency checks once the errors in the code are fixed (see individual comments).

  • The tests need to be updated, but we can wait to do that until we check that it doesn't break any reverse dependencies

  • We should put some information about this in main loo_compare documentation, not only the print method

  • I actually think it would be good to print these probabilities by default. What was the reason you turned it off by default? (Actually right now the printing will error by default, but once fixed it will be off by default). I don't think the printing should affect backwards compatibility (or am I missing something?). It would be adding more columns could potentially affect something, right? Or was there a different reason you turned off printing the probabilities by default?

Comment thread R/loo_compare.R Outdated
Comment thread R/loo_compare.R Outdated
Comment thread R/loo_compare.R Outdated
Comment thread R/loo_compare.R Outdated
avehtari and others added 2 commits October 17, 2025 20:22
Co-authored-by: Jonah Gabry <jgabry@gmail.com>
@avehtari
Copy link
Copy Markdown
Member Author

Ok, let's change them to be printed by default.

I had the column name as p_worse so that it will also tell direction. Then I considered to switching to pnorm as it tells that it's based on the normal approximation, and then did get betweeb

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Oct 17, 2025

Setting simplify=FALSE when calling the print method also errors due to a weird interaction with the p_worse. But I think we can get rid of the simplify argument. I've never seen anyone use it and I doubt it will break any packages (we can find out by reverse dependency checks). I will push a commit fixing this.

@avehtari
Copy link
Copy Markdown
Member Author

I would love to get rid of simplify argument to make it difficult to print looic. I have seen people using it. A quick search shows e.g. Doing Bayesian Data Analysis in brms and the tidyverse, https://www.statistical-rethinking.robitalec.ca/week-8.html, https://www.andreashandel.com/posts/2022-02-24-longitudinal-multilevel-bayes-3/, https://dirmeier.github.io/rstansequential/index.html. But if it's not used by packages, we can just drop it (and always simplify)

Comment thread R/loo_compare.R Outdated
Comment thread R/loo_compare.R Outdated
diag_pnorm[khat_diff > 0.5] <- paste0("khat_diff > 0.5")
}
rownames(comp) <- rnms
comp <- cbind(data.frame(elpd_diff = elpd_diff, se_diff = se_diff,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also changes the returned object to a data frame when it used to be a matrix. That could potentially cause reverse dependency issues, but we'll see when I run the checks. This is a necessary change, though, since we're mixing numeric columns with text columns for the diagnostic.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 17, 2025

Codecov Report

❌ Patch coverage is 84.12698% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.70%. Comparing base (85db294) to head (b54fa3e).

Files with missing lines Patch % Lines
R/loo_compare.R 83.87% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #300      +/-   ##
==========================================
- Coverage   92.75%   92.70%   -0.06%     
==========================================
  Files          31       31              
  Lines        2983     3029      +46     
==========================================
+ Hits         2767     2808      +41     
- Misses        216      221       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Oct 17, 2025

I fixed the tests so this is passing now. I will run reverse dependency checks over the weekend hopefully. I just put one question for you @avehtari in the review comments about how to label to diagnostic.

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Oct 17, 2025

Starting a comment to keep track of reverse dependency issues. The reverse dependency checks are still running (they will take a while). I will update here as I discover new issues.

  • BAMBI
    • Error in main function due to new return type being a data frame instead of a matrix
    • For this particular error I was able to put a fix in loo_compare (it had to do with row/col names) so no PR is needed.
  • BayesERtools
    • Test error due to new return type being a data frame instead of a matrix
    • I forked the repo and fixed the test. Will wait to make a PR until we merge this PR into loo.
  • flocker
    • Test error due to new return type being a data frame instead of a matrix (and with additional columns)
    • I made a fork and fixed the test. Will wait to make a PR.
  • hbamr
    • uses simplify=FALSE in a vignette but not in any package functions
    • I made a fork and removed the use of simplify. Will wait to make a PR.
  • rstanarm
    • has some custom print stuff that assumed a matrix
    • I was actually able to get around this in loo_compare itself so no PR is necessary

@avehtari
Copy link
Copy Markdown
Member Author

  1. About the diagnostic messages
    Should we use
    N<100
    |elpd_diff|<4
    khat_diff > 0.5
    or
    N<100, small data
    |elpd_diff|<4, similar predictions
    khat_diff > 0.5, possible outliers
    ...
    Maybe the shorter ones are better?

  2. If one column name is p_worse, is diag_pnorm ok? Or should it be diag_p_worse, or just diag, diag_diff, or diagnostic?

  3. Since we are switching to using data.frame, can we have model names as column and not as row names? I have been using tinytable for pretty printing, but then I need the model name as a column.

  4. Illustration of how the output looks now (+ tinlytable) at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

@avehtari
Copy link
Copy Markdown
Member Author

avehtari commented Nov 7, 2025

We discussed this PR in our group meeting and I think at this point we should drop k_diff diagnostic as 1) it may produce too many false positives with typical data sizes, and 2) it's not directly connected to the LOO uncertainty paper point about model misspecification. Although it is likely to be useful, we need more experiments to be able to provide better advise how to interpret it and what additional checks can be done. We can keep the other diagnostics at this time.

@avehtari
Copy link
Copy Markdown
Member Author

I have updated a case study using this PR https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

There are now more explanations on how to interpret the diagnostics

@florence-bockting
Copy link
Copy Markdown

florence-bockting commented Apr 1, 2026

After a discussion with @avehtari, we agreed on the following next steps for this PR:

  • add an explanation of the diagnostics and their relationships into loo-glossary
  • include a link with reference to loo-glossary for further information to the print method
  • go over case-study and check whether it is self-contained (i.e., can be read without having read the paper first)

@florence-bockting
Copy link
Copy Markdown

I added additional sections in the loo-glossary and added a user message in the loo_compare output with reference to the loo-glossary as follows:

  model elpd_diff se_diff p_worse diag_diff diag_elpd
 model1       0.0     0.0      NA                    
 model2      -4.1     0.1    1.00   N < 100          

Diagnostic flags present. See ?`loo-glossary` (sections `diag_diff` and `diag_elpd`)
or https://mc-stan.org/loo/reference/loo-glossary.html.

Copy link
Copy Markdown
Member Author

@avehtari avehtari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent loo-glossary text. Just one comment about Bengio and Grandvalet reference

Comment thread R/loo-glossary.R Outdated
Comment thread R/loo_compare.R Outdated
@jgabry
Copy link
Copy Markdown
Member

jgabry commented Apr 16, 2026

@florence-bockting @avehtari Thank you in advance for doing the new round of reverse dependency checks.
@florence-bockting please also officially add yourself to the DESCRIPTION file as part of this PR! Or separately from this PR, whatever you prefer.

@florence-bockting
Copy link
Copy Markdown

florence-bockting commented Apr 19, 2026

Re-running reverse dependency checks. In the following a list of packages where issues occur:

  • BAMBI 2.3.6
    • extracted model names as rownames which changed now due to new "model" column
    • Public PR: PR#5
  • BayesERtools 0.2.4
    • extracted model names as rownames which changed now due to new "model" column
    • Public PR: PR#20
  • clmstan 0.1.1
    • checked in tests for is.matrix; changed to is.data.frame in case "model" column exists
    • Public PR: PR#1
  • disbayes 1.1.0
    • uses in test rownames for extracting model; changed to use of column value if column "model" exists
    • Public PR: PR#3
  • flocker 1.0.0
    • test for rownames; tutorial shows output of loo_compare as comment -> changed corresponding test and output comment
    • Public PR: PR#116
  • ubms 1.2.8
    • assumes a type matrix for extracting models; update to switch between data.frame and matrix depending whether "model" column exists
    • Public PR: PR#88

@avehtari: I have now worked out fixes for all affected libraries which are currently still local. When I understand the process correctly, then we can now merge this PR into master and I will make the PRs in the respective packages public. Or some other thoughts on this?

@florence-bockting
Copy link
Copy Markdown

@VisruthSK The touchstone build process fails in this repo. Can you please have a look at this? Thank you.

@github-actions
Copy link
Copy Markdown

This is how benchmark results would change (along with a 95% confidence interval in relative change) if b54fa3e is merged into master:

  • ✔️loo_function: 1.8s -> 1.8s [-0.4%, +0.59%]
  • ✔️loo_matrix: 1.33s -> 1.33s [-0.31%, +0.83%]
    Further explanation regarding interpretation and methodology can be found in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants