Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions docs/src/high_dimension.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,123 @@
===========================
Inference in high dimension
===========================

Naive inference in high dimension is ill-posed
----------------------------------------------

In some cases, data represent high-dimensional measurements of some phenomenon of interest (e.g. imaging or genotyping). The common characteristic of these problems is to be very high-dimensional and lead to correlated features. Both aspects are clearly detrimental to conditional inference, making it both expensive and powerless:

* Expensive: most learers are quadratic or cubic in the number of features. Moreover per-feature inference generally entails a loop over features
* powerless: As dimensionality and correlation increase, it becomes harder and harder to isolate the contribution of each variable, meaning that conditional inference is ill-posed.

This is illustrated in the above example, where the Desparsified Lasso (:class:`hidimstat.DesparsifiedLasso`) struggles
to identify relevant features. We need some data to start::

>>> n_samples = 100
>>> shape = (40, 40)
>>> roi_size = 4 # size of the edge of the four predictive regions

# generating the data
>>> from hidimstat._utils.scenario import multivariate_simulation_spatial
>>> X_init, y, beta, epsilon = multivariate_simulation_spatial(
>>> n_samples, shape, roi_size, signal_noise_ratio=10., smooth_X=1)

Then we perform inference on this data using the Desparsified Lasso::

>>> from hidimstat.desparsified_lasso import DesparsifiedLasso
>>> dlasso = DesparsifiedLasso().fit(X_init, y)
>>> dlasso.importance(X_init, y) # compute importance score and associated corrected p-values

# compute estimated support
>>> import numpy as np
>>> alpha = .05 # alpha is the significance level for the statistical test
>>> selected_dl = dlasso.pvalues_ < alpha / (shape[0] * shape[1])
>>> true_support = beta > 0
>>> print(f'Desparsified Lasso selected {np.sum(selected_dl * true_support)} features among {np.sum(true_support)} ')
Desparsified Lasso selected 19 features among 64


Feature Grouping and its shortcomings
-------------------------------------

As discussed earlier, feature grouping is a meaningful solution to deal with such cases: it reduces the number of features to condition on, and generally also decreases the level of correlation between features.

.. seealso::

* The :ref:`Grouping documentation <grouping>`


As hinted in :footcite:t:`meinshausen2009pvalues` an efficient way to deal with such configuration is to take the per-group average of the features: this leads to a *reduced design*. After inference, all the feature in a given group obtain the p-value of the group representative. When the inference engine is Desparsified Lasso, the resulting method is called Clustered Desparsified lasso, or :class:`hidimstat.CluDL`.

Using the same example as previously, we start by defining a clustering method that will perform the grouping. For image data, Ward clustering is a good default model, because it takes into account the neighboring structure among pixels, which avoids creating overly messy clusters::

>>> from sklearn.feature_extraction import image
>>> from sklearn.cluster import FeatureAgglomeration
>>> n_clusters = 200
>>> connectivity = image.grid_to_graph(n_x=shape[0], n_y=shape[1])
>>> ward = FeatureAgglomeration(
>>> n_clusters=n_clusters, connectivity=connectivity, linkage="ward")

Equipped with this, we can use CluDL:

>>> from hidimstat import CluDL
>>> from sklearn.linear_model import LassoCV
>>> cludl = CluDL(clustering=ward, desparsified_lasso=DesparsifiedLasso(estimator=LassoCV()))
>>> cludl.fit_importance(X_init, y)
# compute estimated support
>>> selected_cdl = cludl.fwer_selection(alpha, n_tests=n_clusters)
>>> print(f'Clustered Desparsified Lasso selected {np.sum(selected_cdl * true_support)} features among {np.sum(true_support)} ')
Clustered Desparsified Lasso selected 51 features among 64


Note that inference is also way faster on the compressed representation.

The issue is that very-high-dimensional data (biological, images, etc.) do not have any canonical grouping structure. Hence, they rely on grouping obtained from the data, typically with clustering technique. However, the resulting clusters bring some undesirable randomness. Think that imputing slightly different data would lead to different clusters. Since there is no globally optimal clustering, the wiser solution is to *average* the results across clusterings. Since it may not be a good idea to average p-values, an alternative *ensembling* or *aggregation* strategy is used instead. When the inference engine is Desparsified Lasso, the resulting method is called Ensemble of Clustered Desparsified lasso, or :class:`hidimstat.EnCluDL`.

The behavior is illustrated here::

>>> from hidimstat import EnCluDL

# ensemble of clustered desparsified lasso (EnCluDL)
>>> encludl = EnCluDL(
>>> clustering=ward, desparsified_lasso=DesparsifiedLasso(estimator=LassoCV()), n_bootstraps=20, random_state=0)
>>> encludl.fit_importance(X_init, y)
>>> selected_ecdl = encludl.fwer_selection(alpha, n_tests=n_clusters)
>>> print(f'Ensemble of Clustered Desparsified Lasso selected {np.sum(selected_ecdl * true_support)} features among {np.sum(true_support)} ')
Ensemble of Clustered Desparsified Lasso selected 57 features among 64

.. topic:: **Full example**

See the following example for a full file running the analysis:
:ref:`sphx_glr_generated_gallery_examples_plot_2D_simulation_example.py`

What type of Control does this Ensemble of CLustered inference come with ?
--------------------------------------------------------------------------

Ensemble of Clustered Inference is not a local method, so control cannot be maintained at each brain site in isolation.
The notion of a false positive must be mitigated by the non-local characteristic of the inference performed.
Thus, we introduce the concept of a :math:`\delta`-false positive:
A detection is a delta-false positive if it is at a distance greater than $\delta$ from the support, which is the set of true positives.
Thus, what is controlled is the :math:`\delta`-FWER, i.e., the probability of reporting a single false :math:`\delta`-false positive.
In other words, EnCluDL will likely only report detections at a distance less than :math:`\delta` from the true support.

What is :math:`\delta` ? It is the diameter of the clusters used in the CluDL procedure.


The details of the method and the underlying guarantees are described in :footcite:t:`chevalier2022spatially`


.. topic:: **Other examples**

See the following example for an application to the analysis of fMRI data:
:ref:`sphx_glr_generated_gallery_examples_plot_fmri_data_example.py`

See the following example for an illustration on digit classification with MNIST:
:ref:`sphx_glr_generated_gallery_examples_plot_digits.py`


References
----------
.. footbibliography::


4 changes: 1 addition & 3 deletions src/hidimstat/desparsified_lasso.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,6 @@ def fit(self, X, y):
X_ = X
y_ = y
self.n_samples_, n_features = X_.shape

try:
check_is_fitted(self.estimator)
except NotFittedError:
Expand All @@ -233,9 +232,8 @@ def fit(self, X, y):
# use the cross-validation for define the best alpha of Lasso
self.estimator.set_params(n_jobs=self.n_jobs)
self.estimator.fit(X_, y_)

# Lasso regression and noise standard deviation estimation
self.sigma_hat_ = memory.cache(reid, ignore=["n_jobs"])(
self.sigma_hat_ = reid(
self.estimator.coef_, # estimated support of the variable importance
self.estimator.predict(X_) - y_, # compute the residual,
tolerance=self.tolerance_reid,
Expand Down
9 changes: 3 additions & 6 deletions src/hidimstat/ensemble_clustered_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,10 +113,7 @@ def fit(self, X, y):

# Desparsified lasso inference
self.desparsified_lasso.random_state = self.random_state
self.desparsified_lasso_ = memory.cache(self.desparsified_lasso.fit)(
X_reduced, y
)

self.desparsified_lasso_ = self.desparsified_lasso.fit(X_reduced, y)
return self

def importance(self, X=None, y=None):
Expand Down Expand Up @@ -384,8 +381,8 @@ def fit(self, X, y):

self.clustering_desparsified_lassos_ = Parallel(n_jobs=self.n_jobs)(
delayed(self._joblib_fit_one)(
desparsified_lasso=self.desparsified_lasso,
clustering=self.clustering,
desparsified_lasso=clone(self.desparsified_lasso),
clustering=clone(self.clustering),
cluster_boostrap_size=self.cluster_boostrap_size,
bootstrap_groups=self.bootstrap_groups,
X=X,
Expand Down
2 changes: 1 addition & 1 deletion test/test_base_variable_importance.py
Original file line number Diff line number Diff line change
Expand Up @@ -461,7 +461,7 @@ def test_fwer_selection():
power_list = []
n_features = 100
target_fdr = 0.1
test_tol = 0.05
test_tol = 0.1

for _ in range(100):
vim = BaseVariableImportance()
Expand Down
4 changes: 2 additions & 2 deletions test/test_ensemble_clustered_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,8 +220,8 @@ def test_cludl_temporal():
)
fdp_list.append(fdp)
power_list.append(power)
assert np.mean(power_list) >= 0.5 - test_tol
assert np.mean(fdp_list) <= alpha + test_tol
assert np.mean(power_list) >= 0.5
assert np.mean(fdp_list) <= 2 * alpha


def test_encludl_temporal():
Expand Down