Skip to content

Feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds#483

Open
SamuelSchlesinger wants to merge 7 commits intoleanprover:mainfrom
SamuelSchlesinger:feat/pac-learning-sample-complexity
Open

Feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds#483
SamuelSchlesinger wants to merge 7 commits intoleanprover:mainfrom
SamuelSchlesinger:feat/pac-learning-sample-complexity

Conversation

@SamuelSchlesinger
Copy link
Copy Markdown

This PR defines the Probably Approximately Correct (PAC) learning model. Then, it formalizes the information theoretic lower bound on sample complexity for PAC learning implemented in "A general lower bound on the number of examples needed for learning" (Andrzej Ehrenfeucht, David Haussler, Michael Kearns, Leslie Valiant, 1989).

Organized the PR such that it can be reviewed commit by commit.

Introduce the core PAC learning model: ConceptClass, LabeledSample,
Learner, hypothesisError, sampleOf, and seenElements. Adds module
index entry and bibliography references for [EHKV1989] and [Valiant1984].
Define SetShatters and vcDim for concept classes. Bridge to Mathlib's
Finset.Shatters via finsetShatters_iff_setShatters.
Reusable lemmas for the EHKV proof: Bernoulli's inequality, product
measure support, seenElements measurability, and sample agreement.
Combinatorial core of the EHKV lower bound: an involution on concepts
pairs each with its complement on unseen points, showing at least half
the concepts force large error on any bad sample.
Construct the discrete probability measure on the shattered set used
in the EHKV proof: heavy mass on one point, uniform on the rest.
Assemble the full EHKV proof: Markov bound on bad samples, involution
pairing, and adversarial measure yield the lower bound
m ≥ (VCdim(C) - 1) / (32ε) for any (ε, δ)-learner.
@SamuelSchlesinger
Copy link
Copy Markdown
Author

If this is too much to review in one pass, I am happy to make a sequence of PRs to build it up. I think the review commit by commit works personally, but I am happy to oblige whatever folks would prefer.

Making a pass of proof golfing right now, I hope some of these lemmas can be simplified at least somewhat.

- Replace verbose `show ... from by rw [...]` with direct `← lemma` rewrites
- Use `tauto` to close symmetric case analysis in hypothesisError_eq_of_inter_eq
- Simplify sampleOf_eq_of_agree to one-line `simp`
- Shorten SetShatters.subset proof using `▸` and `id`
- Replace `Pi.one_apply` for indicator simplification in adversarialMeasure_singleton
- Use dot notation `.le`/`.trans` over `le_of_lt`/`le_trans`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant