Skip to content

Commit eb06aae

Browse files
DOC-5831 updated spec (2nd opinion from ChatGPT instead of Claude)
1 parent 753a9a2 commit eb06aae

File tree

1 file changed

+111
-0
lines changed

1 file changed

+111
-0
lines changed

build/jupyterize/SPECIFICATION.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,37 @@ elif step_name:
189189

190190
**See**: [Language-Specific Features](#language-specific-features) section for detailed implementation.
191191

192+
### 9. Unwrapping Patterns: Single‑line vs Multi‑line, and Dedenting (Based on Implementation Experience)
193+
194+
During implementation, several non‑obvious details significantly reduced bugs and rework:
195+
196+
- Pattern classes and semantics
197+
- Single‑line patterns: When `start_pattern == end_pattern`, treat as “remove this line only”. Examples: `public class X {` or `public void Run() {` on one line.
198+
- Multi‑line patterns: When `start_pattern != end_pattern`, remove the start line, everything until the end line, and the end line itself. Use this to strip a wrapper’s braces while preserving the inner code with a separate “keep content” strategy.
199+
- Use anchored patterns with `^` to avoid over‑matching. Prefer `re.match` (anchored at the start) over `re.search`.
200+
201+
- Wrappers split across cells
202+
- Real C# files often split wrappers across lines/blocks (e.g., class name on line N, `{` or `}` in later lines). Because parsing splits code into preamble/step cells, wrapper open/close tokens may land in separate cells.
203+
- Practical approach: Use separate, simple patterns to remove opener lines (class/method declarations with `{` either on the same line or next line) and a generic pattern to remove solitary closing braces in any cell.
204+
205+
- Order of operations inside cell creation
206+
1) Apply unwrapping patterns (in the order listed in configuration)
207+
2) Dedent code (e.g., `textwrap.dedent`) so content previously nested inside wrappers aligns to column 0
208+
3) Strip trailing whitespace (e.g., `rstrip()`)
209+
4) Skip empty cells
210+
211+
- Dedent all cells when unwrapping is enabled
212+
- Even if a particular cell didn’t change after unwrapping, its content may still be indented due to having originated inside a method/class in the source file. Dedent ALL cells whenever `unwrap_patterns` are configured for the language.
213+
214+
- Logging for traceability
215+
- Emit `DEBUG` logs per applied pattern (e.g., pattern `type`) to simplify diagnosing regex issues.
216+
217+
- Safety tips for patterns
218+
- Anchor with `^` and keep them specific; avoid overly greedy constructs.
219+
- Keep patterns minimal and composable (e.g., separate `class_opening`, `method_opening`, `closing_braces`).
220+
- Validate patterns at startup or wrap application with try/except to warn and continue on malformed regex.
221+
222+
192223
---
193224

194225
## Code Quality Patterns
@@ -802,6 +833,86 @@ public class SyncLandingExample {
802833
- Harder to maintain
803834
- Breaks existing examples
804835

836+
### Configuration Schema and Semantics (Implementation-Proven)
837+
838+
- Location: `build/jupyterize/jupyterize_config.json`
839+
- Keys: Lowercased language names (`"c#"`, `"python"`, `"node.js"`, ...)
840+
- Structure per language:
841+
- `boilerplate`: Array of strings (each becomes a line in the first code cell)
842+
- `unwrap_patterns`: Array of pattern objects with fields:
843+
- `type` (string): Human-readable label used in logs
844+
- `pattern` (regex string): Start condition (anchored with `^` recommended)
845+
- `end_pattern` (regex string): End condition
846+
- `keep_content` (bool):
847+
- `true` → remove wrapper start/end lines, keep the inner content (useful for `{ ... }` ranges)
848+
- `false` → remove the matching line(s) entirely
849+
- If `pattern == end_pattern` → remove only the single matching line
850+
- If `pattern != end_pattern` → remove from first match through end match, inclusive
851+
- `description` (optional): Intent for maintainers
852+
853+
Minimal example (C#) reflecting patterns that worked in practice:
854+
855+
```json
856+
{
857+
"c#": {
858+
"boilerplate": [
859+
"#r \"nuget: NRedisStack, 0.12.0\"",
860+
"#r \"nuget: StackExchange.Redis, 2.6.122\""
861+
],
862+
"unwrap_patterns": [
863+
{ "type": "class_single_line", "pattern": "^\\s*public\\s+class\\s+\\w+.*\\{\\s*$", "end_pattern": "^\\s*public\\s+class\\s+\\w+.*\\{\\s*$", "keep_content": false },
864+
{ "type": "class_opening", "pattern": "^\\s*public\\s+class\\s+\\w+", "end_pattern": "^\\s*\\{\\s*$", "keep_content": false },
865+
{ "type": "method_single_line", "pattern": "^\\s*public\\s+void\\s+Run\\(\\).*\\{\\s*$", "end_pattern": "^\\s*public\\s+void\\s+Run\\(\\).*\\{\\s*$", "keep_content": false },
866+
{ "type": "method_opening", "pattern": "^\\s*public\\s+void\\s+Run\\(\\)", "end_pattern": "^\\s*\\{\\s*$", "keep_content": false },
867+
{ "type": "closing_braces", "pattern": "^\\s*\\}\\s*$", "end_pattern": "^\\s*\\}\\s*$", "keep_content": false }
868+
]
869+
}
870+
}
871+
```
872+
873+
Notes:
874+
- Listing order matters. Apply openers before generic closers (as above) to avoid accidentally stripping desired content.
875+
- Keep patterns intentionally narrow and anchored to reduce false positives.
876+
877+
### Runtime Order of Operations (within create_cells)
878+
879+
1) Load `lang_config = load_language_config(language)`
880+
2) If present, insert a boilerplate cell first
881+
3) For each parsed block:
882+
- Apply `unwrap_code(code, language)` (sequentially over `unwrap_patterns`)
883+
- Dedent with `textwrap.dedent(code)` whenever unwrapping is configured for the language
884+
885+
> Note: When language-specific features are enabled, prefer the extended signature `create_cells(parsed_blocks, language)` and the runtime order defined in the Language-Specific Features section (boilerplate → unwrap → dedent → rstrip → skip empty). The simplified example above illustrates the core cell construction only.
886+
887+
- `rstrip()` to remove trailing whitespace
888+
- Skip cell if now empty
889+
4) Add step metadata if available
890+
891+
This order ensures wrapper removal doesn’t leave code over-indented and avoids generating spurious empty cells.
892+
893+
### Testing Checklist (Language-Specific)
894+
895+
- Boilerplate
896+
- First cell is boilerplate for languages with `boilerplate` configured
897+
- Languages without `boilerplate` configured do not get a boilerplate cell
898+
- Unwrapping
899+
- Class and method wrappers (single-line and multi-line) are removed
900+
- Closing braces are removed wherever they appear
901+
- Inner content remains and is dedented to column 0
902+
- Robustness
903+
- Missing configuration file → proceed without boilerplate/unwrapping
904+
- Malformed regex → warn and continue; no crash
905+
- Real repository example file converts correctly end-to-end
906+
907+
### Edge Cases and Gotchas
908+
909+
- Wrappers split across cells: rely on separate opener and generic `}` patterns
910+
- Dedent all cells when unwrapping is enabled (not only those that changed)
911+
- Anchoring with `^` is crucial to avoid removing mid-line braces in string literals or comments
912+
- Apply patterns in a safe order: openers before closers
913+
- Tabs vs spaces: dedent works on common leading whitespace; prefer spaces in examples
914+
915+
805916
### Recommended Implementation Strategy
806917

807918
**Phase 1: Boilerplate Injection** (High Priority)

0 commit comments

Comments
 (0)