Feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds#483
Open
SamuelSchlesinger wants to merge 7 commits intoleanprover:mainfrom
Open
Conversation
Introduce the core PAC learning model: ConceptClass, LabeledSample, Learner, hypothesisError, sampleOf, and seenElements. Adds module index entry and bibliography references for [EHKV1989] and [Valiant1984].
Define SetShatters and vcDim for concept classes. Bridge to Mathlib's Finset.Shatters via finsetShatters_iff_setShatters.
Reusable lemmas for the EHKV proof: Bernoulli's inequality, product measure support, seenElements measurability, and sample agreement.
Combinatorial core of the EHKV lower bound: an involution on concepts pairs each with its complement on unseen points, showing at least half the concepts force large error on any bad sample.
Construct the discrete probability measure on the shattered set used in the EHKV proof: heavy mass on one point, uniform on the rest.
Assemble the full EHKV proof: Markov bound on bad samples, involution pairing, and adversarial measure yield the lower bound m ≥ (VCdim(C) - 1) / (32ε) for any (ε, δ)-learner.
Author
|
If this is too much to review in one pass, I am happy to make a sequence of PRs to build it up. I think the review commit by commit works personally, but I am happy to oblige whatever folks would prefer. Making a pass of proof golfing right now, I hope some of these lemmas can be simplified at least somewhat. |
- Replace verbose `show ... from by rw [...]` with direct `← lemma` rewrites - Use `tauto` to close symmetric case analysis in hypothesisError_eq_of_inter_eq - Simplify sampleOf_eq_of_agree to one-line `simp` - Shorten SetShatters.subset proof using `▸` and `id` - Replace `Pi.one_apply` for indicator simplification in adversarialMeasure_singleton - Use dot notation `.le`/`.trans` over `le_of_lt`/`le_trans`
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR defines the Probably Approximately Correct (PAC) learning model. Then, it formalizes the information theoretic lower bound on sample complexity for PAC learning implemented in "A general lower bound on the number of examples needed for learning" (Andrzej Ehrenfeucht, David Haussler, Michael Kearns, Leslie Valiant, 1989).
Organized the PR such that it can be reviewed commit by commit.