Skip to content

Commit 3bc4ed7

Browse files
fix: deduplicate env-var helper, add --tag flag, and tidy docs/tests
Refactor * Deduplicate `_get_env_var` by moving it to `utils/config_utils.py`. * Remove redundant `local_path` parameter from `_process_file`. Fix * Add missing `tag` parameter to `_async_main`, `main`, and `_CLIArgs`. * Introduce the missing `--tag` CLI flag. Docs and consistency * Update `README.md` for `markdownlint` compliance and other minor tweaks. * Add missing argument docs to `_async_main` docstring. * Re-order global variables in `config.py` for consistency. * Swap the order of `include_patterns` and `ignore_patterns` in `parse_query` and `ingest_async`. * Tidy docstrings for `_async_main`, `IngestionQuery`, `parse_query`, `ingest_async`, and `ingest`. Tests * Temporarily disable `[tool.ruff.lint.isort]` due to conflict with the `isort` pre-commit hook. * Add new arguments to `expected` in `test_parse_query_without_host`. * Run `pre-commit` hooks.
1 parent 3c95405 commit 3bc4ed7

File tree

13 files changed

+159
-132
lines changed

13 files changed

+159
-132
lines changed

README.md

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ By default, the digest is written to a text file (`digest.txt`) in your current
135135
- Use `--output/-o <filename>` to write to a specific file.
136136
- Use `--output/-o -` to output directly to `STDOUT` (useful for piping to other tools).
137137

138-
Configure processing limits:
138+
### 🔧 Configure processing limits
139139

140140
```bash
141141
# Set higher limits for large repositories
@@ -157,32 +157,34 @@ See more options and usage details with:
157157
gitingest --help
158158
```
159159

160-
### 🔧 Configuration via Environment Variables
160+
### Configuration via Environment Variables
161161

162162
You can configure various limits and settings using environment variables. All configuration environment variables start with the `GITINGEST_` prefix:
163163

164-
**File Processing Configuration:**
165-
- `GITINGEST_MAX_FILE_SIZE` - Maximum size of a single file to process (default: 10485760 bytes, 10MB)
166-
- `GITINGEST_MAX_FILES` - Maximum number of files to process (default: 10000)
167-
- `GITINGEST_MAX_TOTAL_SIZE_BYTES` - Maximum size of output file (default: 524288000 bytes, 500MB)
168-
- `GITINGEST_MAX_DIRECTORY_DEPTH` - Maximum depth of directory traversal (default: 20)
169-
- `GITINGEST_DEFAULT_TIMEOUT` - Default operation timeout in seconds (default: 60)
170-
- `GITINGEST_OUTPUT_FILE_NAME` - Default output filename (default: "digest.txt")
171-
- `GITINGEST_TMP_BASE_PATH` - Base path for temporary files (default: system temp directory)
164+
#### File Processing Configuration
172165

173-
**Server Configuration (for self-hosting):**
174-
- `GITINGEST_MAX_DISPLAY_SIZE` - Maximum size of content to display in UI (default: 300000 bytes)
175-
- `GITINGEST_DELETE_REPO_AFTER` - Repository cleanup timeout in seconds (default: 3600, 1 hour)
176-
- `GITINGEST_MAX_FILE_SIZE_KB` - Maximum file size for UI slider in KB (default: 102400, 100MB)
177-
- `GITINGEST_MAX_SLIDER_POSITION` - Maximum slider position in UI (default: 500)
166+
- `GITINGEST_MAX_FILE_SIZE` - Maximum size of a single file to process *(default: 10485760 bytes, 10 MB)*
167+
- `GITINGEST_MAX_FILES` - Maximum number of files to process *(default: 10000)*
168+
- `GITINGEST_MAX_TOTAL_SIZE_BYTES` - Maximum size of output file *(default: 524288000 bytes, 500 MB)*
169+
- `GITINGEST_MAX_DIRECTORY_DEPTH` - Maximum depth of directory traversal *(default: 20)*
170+
- `GITINGEST_DEFAULT_TIMEOUT` - Default operation timeout in seconds *(default: 60)*
171+
- `GITINGEST_OUTPUT_FILE_NAME` - Default output filename *(default: "digest.txt")*
172+
- `GITINGEST_TMP_BASE_PATH` - Base path for temporary files *(default: system temp directory)*
178173

179-
**Example usage:**
174+
#### Server Configuration (for self-hosting)
175+
176+
- `GITINGEST_MAX_DISPLAY_SIZE` - Maximum size of content to display in UI *(default: 300000 bytes)*
177+
- `GITINGEST_DELETE_REPO_AFTER` - Repository cleanup timeout in seconds *(default: 3600, 1 hour)*
178+
- `GITINGEST_MAX_FILE_SIZE_KB` - Maximum file size for UI slider in kB *(default: 102400, 100 MB)*
179+
- `GITINGEST_MAX_SLIDER_POSITION` - Maximum slider position in UI *(default: 500)*
180+
181+
#### Example usage
180182

181183
```bash
182184
# Configure for large scientific repositories
183185
export GITINGEST_MAX_FILES=50000
184-
export GITINGEST_MAX_FILE_SIZE=20971520 # 20MB
185-
export GITINGEST_MAX_TOTAL_SIZE_BYTES=1073741824 # 1GB
186+
export GITINGEST_MAX_FILE_SIZE=20971520 # 20 MB
187+
export GITINGEST_MAX_TOTAL_SIZE_BYTES=1073741824 # 1 GB
186188

187189
gitingest https://github.com/some/large-repo
188190
```
@@ -219,9 +221,9 @@ summary, tree, content = ingest("https://github.com/username/repo-with-submodule
219221
# Configure limits programmatically
220222
summary, tree, content = ingest(
221223
"https://github.com/username/large-repo",
222-
max_file_size=20 * 1024 * 1024, # 20MB per file
224+
max_file_size=20 * 1024 * 1024, # 20 MB per file
223225
max_files=50000, # 50k files max
224-
max_total_size_bytes=1024**3, # 1GB total
226+
max_total_size_bytes=1024**3, # 1 GB total
225227
max_directory_depth=30 # 30 levels deep
226228
)
227229
```

pyproject.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -97,9 +97,9 @@ per-file-ignores = { "tests/**/*.py" = ["S101"] } # Skip the "assert used" warni
9797
[tool.ruff.lint.pylint]
9898
max-returns = 10
9999

100-
[tool.ruff.lint.isort]
101-
order-by-type = true
102-
case-sensitive = true
100+
# [tool.ruff.lint.isort]
101+
# order-by-type = true
102+
# case-sensitive = true
103103

104104
[tool.pycln]
105105
all = true

src/gitingest/cli.py

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
import click
1010
from typing_extensions import Unpack
1111

12-
from gitingest.config import MAX_FILE_SIZE, MAX_FILES, MAX_TOTAL_SIZE_BYTES, MAX_DIRECTORY_DEPTH, OUTPUT_FILE_NAME
12+
from gitingest.config import MAX_DIRECTORY_DEPTH, MAX_FILE_SIZE, MAX_FILES, MAX_TOTAL_SIZE_BYTES, OUTPUT_FILE_NAME
1313
from gitingest.entrypoint import ingest_async
1414

1515

@@ -22,6 +22,7 @@ class _CLIArgs(TypedDict):
2222
exclude_pattern: tuple[str, ...]
2323
include_pattern: tuple[str, ...]
2424
branch: str | None
25+
tag: str | None
2526
include_gitignored: bool
2627
include_submodules: bool
2728
token: str | None
@@ -63,6 +64,7 @@ class _CLIArgs(TypedDict):
6364
help="Shell-style patterns to include.",
6465
)
6566
@click.option("--branch", "-b", default=None, help="Branch to clone and ingest")
67+
@click.option("--tag", default=None, help="Tag to clone and ingest")
6668
@click.option(
6769
"--include-gitignored",
6870
is_flag=True,
@@ -119,7 +121,7 @@ def main(**cli_kwargs: Unpack[_CLIArgs]) -> None:
119121
$ gitingest --include-pattern "*.js" --exclude-pattern "node_modules/*"
120122
121123
Private repositories:
122-
$ gitingest https://github.com/user/private-repo -t ghp_token
124+
$ gitingest https://github.com/user/private-repo --token ghp_token
123125
$ GITHUB_TOKEN=ghp_token gitingest https://github.com/user/private-repo
124126
125127
Include submodules:
@@ -139,6 +141,7 @@ async def _async_main(
139141
exclude_pattern: tuple[str, ...] | None = None,
140142
include_pattern: tuple[str, ...] | None = None,
141143
branch: str | None = None,
144+
tag: str | None = None,
142145
include_gitignored: bool = False,
143146
include_submodules: bool = False,
144147
token: str | None = None,
@@ -156,21 +159,29 @@ async def _async_main(
156159
A directory path or a Git repository URL.
157160
max_size : int
158161
Maximum file size in bytes to ingest (default: 10 MB).
162+
max_files : int
163+
Maximum number of files to ingest (default: 10,000).
164+
max_total_size : int
165+
Maximum total size of output file in bytes (default: 500 MB).
166+
max_directory_depth : int
167+
Maximum depth of directory traversal (default: 20).
159168
exclude_pattern : tuple[str, ...] | None
160169
Glob patterns for pruning the file set.
161170
include_pattern : tuple[str, ...] | None
162171
Glob patterns for including files in the output.
163172
branch : str | None
164-
Git branch to ingest. If ``None``, the repository's default branch is used.
173+
Git branch to clone and ingest (default: the default branch).
174+
tag : str | None
175+
Git tag to clone and ingest. If ``None``, no tag is used.
165176
include_gitignored : bool
166-
If ``True``, also ingest files matched by ``.gitignore`` or ``.gitingestignore`` (default: ``False``).
177+
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
167178
include_submodules : bool
168179
If ``True``, recursively include all Git submodules within the repository (default: ``False``).
169180
token : str | None
170181
GitHub personal access token (PAT) for accessing private repositories.
171182
Can also be set via the ``GITHUB_TOKEN`` environment variable.
172183
output : str | None
173-
The path where the output file will be written (default: ``digest.txt`` in current directory).
184+
The path where the output file is written (default: ``digest.txt`` in current directory).
174185
Use ``"-"`` to write to ``stdout``.
175186
176187
Raises
@@ -197,9 +208,10 @@ async def _async_main(
197208
max_files=max_files,
198209
max_total_size_bytes=max_total_size,
199210
max_directory_depth=max_directory_depth,
200-
include_patterns=include_patterns,
201211
exclude_patterns=exclude_patterns,
212+
include_patterns=include_patterns,
202213
branch=branch,
214+
tag=tag,
203215
include_gitignored=include_gitignored,
204216
include_submodules=include_submodules,
205217
token=token,

src/gitingest/config.py

Lines changed: 7 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,16 @@
11
"""Configuration file for the project."""
22

3-
import os
43
import tempfile
54
from pathlib import Path
65

7-
# Helper function to get environment variables with type conversion
8-
def _get_env_var(key: str, default, cast_func=None):
9-
"""Get environment variable with GITINGEST_ prefix and optional type casting."""
10-
env_key = f"GITINGEST_{key}"
11-
value = os.environ.get(env_key)
12-
13-
if value is None:
14-
return default
15-
16-
if cast_func:
17-
try:
18-
return cast_func(value)
19-
except (ValueError, TypeError):
20-
print(f"Warning: Invalid value for {env_key}: {value}. Using default: {default}")
21-
return default
22-
23-
return value
6+
from gitingest.utils.config_utils import _get_env_var
247

25-
# Configuration with environment variable support
26-
MAX_FILE_SIZE = _get_env_var("MAX_FILE_SIZE", 10 * 1024 * 1024, int) # Maximum size of a single file to process (10 MB)
27-
MAX_DIRECTORY_DEPTH = _get_env_var("MAX_DIRECTORY_DEPTH", 20, int) # Maximum depth of directory traversal
28-
MAX_FILES = _get_env_var("MAX_FILES", 10_000, int) # Maximum number of files to process
29-
MAX_TOTAL_SIZE_BYTES = _get_env_var("MAX_TOTAL_SIZE_BYTES", 500 * 1024 * 1024, int) # Maximum size of output file (500 MB)
30-
DEFAULT_TIMEOUT = _get_env_var("DEFAULT_TIMEOUT", 60, int) # seconds
8+
MAX_FILE_SIZE = _get_env_var("MAX_FILE_SIZE", 10 * 1024 * 1024, int) # Max file size to process in bytes (10 MB)
9+
MAX_FILES = _get_env_var("MAX_FILES", 10_000, int) # Max number of files to process
10+
MAX_TOTAL_SIZE_BYTES = _get_env_var("MAX_TOTAL_SIZE_BYTES", 500 * 1024 * 1024, int) # Max output file size (500 MB)
11+
MAX_DIRECTORY_DEPTH = _get_env_var("MAX_DIRECTORY_DEPTH", 20, int) # Max depth of directory traversal
3112

32-
OUTPUT_FILE_NAME = _get_env_var("OUTPUT_FILE_NAME", "digest.txt")
13+
DEFAULT_TIMEOUT = _get_env_var("DEFAULT_TIMEOUT", 60, int) # Default timeout for git operations in seconds
3314

15+
OUTPUT_FILE_NAME = _get_env_var("OUTPUT_FILE_NAME", "digest.txt")
3416
TMP_BASE_PATH = Path(_get_env_var("TMP_BASE_PATH", tempfile.gettempdir())) / "gitingest"

src/gitingest/entrypoint.py

Lines changed: 29 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@ async def ingest_async(
2525
max_files: int | None = None,
2626
max_total_size_bytes: int | None = None,
2727
max_directory_depth: int | None = None,
28-
include_patterns: str | set[str] | None = None,
2928
exclude_patterns: str | set[str] | None = None,
29+
include_patterns: str | set[str] | None = None,
3030
branch: str | None = None,
3131
tag: str | None = None,
3232
include_gitignored: bool = False,
@@ -43,17 +43,23 @@ async def ingest_async(
4343
Parameters
4444
----------
4545
source : str
46-
The source to analyze, which can be a URL (for a Git repository) or a local directory path.
46+
A directory path or a Git repository URL.
4747
max_file_size : int
48-
Maximum allowed file size for file ingestion. Files larger than this size are ignored (default: 10 MB).
49-
include_patterns : str | set[str] | None
50-
Pattern or set of patterns specifying which files to include. If ``None``, all files are included.
48+
Maximum file size in bytes to ingest (default: 10 MB).
49+
max_files : int | None
50+
Maximum number of files to ingest (default: 10,000).
51+
max_total_size_bytes : int | None
52+
Maximum total size of output file in bytes (default: 500 MB).
53+
max_directory_depth : int | None
54+
Maximum depth of directory traversal (default: 20).
5155
exclude_patterns : str | set[str] | None
52-
Pattern or set of patterns specifying which files to exclude. If ``None``, no files are excluded.
56+
Glob patterns for pruning the file set.
57+
include_patterns : str | set[str] | None
58+
Glob patterns for including files in the output.
5359
branch : str | None
54-
The branch to clone and ingest (default: the default branch).
60+
Git branch to clone and ingest (default: the default branch).
5561
tag : str | None
56-
The tag to clone and ingest. If ``None``, no tag is used.
62+
Git tag to to clone and ingest. If ``None``, no tag is used.
5763
include_gitignored : bool
5864
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
5965
include_submodules : bool
@@ -62,7 +68,7 @@ async def ingest_async(
6268
GitHub personal access token (PAT) for accessing private repositories.
6369
Can also be set via the ``GITHUB_TOKEN`` environment variable.
6470
output : str | None
65-
File path where the summary and content should be written.
71+
File path where the summary and content is written.
6672
If ``"-"`` (dash), the results are written to ``stdout``.
6773
If ``None``, the results are not written to a file.
6874
@@ -84,8 +90,8 @@ async def ingest_async(
8490
max_total_size_bytes=max_total_size_bytes,
8591
max_directory_depth=max_directory_depth,
8692
from_web=False,
87-
include_patterns=include_patterns,
8893
ignore_patterns=exclude_patterns,
94+
include_patterns=include_patterns,
8995
token=token,
9096
)
9197

@@ -110,8 +116,8 @@ def ingest(
110116
max_files: int | None = None,
111117
max_total_size_bytes: int | None = None,
112118
max_directory_depth: int | None = None,
113-
include_patterns: str | set[str] | None = None,
114119
exclude_patterns: str | set[str] | None = None,
120+
include_patterns: str | set[str] | None = None,
115121
branch: str | None = None,
116122
tag: str | None = None,
117123
include_gitignored: bool = False,
@@ -128,23 +134,23 @@ def ingest(
128134
Parameters
129135
----------
130136
source : str
131-
The source to analyze, which can be a URL (for a Git repository) or a local directory path.
137+
A directory path or a Git repository URL.
132138
max_file_size : int
133-
Maximum allowed file size for file ingestion. Files larger than this size are ignored (default: 10 MB).
139+
Maximum file size in bytes to ingest (default: 10 MB).
134140
max_files : int | None
135-
Maximum number of files to process. If ``None``, uses the default from config (default: 10,000).
141+
Maximum number of files to ingest (default: 10,000).
136142
max_total_size_bytes : int | None
137-
Maximum total size of all files to process in bytes. If ``None``, uses the default from config (default: 500 MB).
143+
Maximum total size of output file in bytes (default: 500 MB).
138144
max_directory_depth : int | None
139-
Maximum depth of directory traversal. If ``None``, uses the default from config (default: 20).
140-
include_patterns : str | set[str] | None
141-
Pattern or set of patterns specifying which files to include. If ``None``, all files are included.
145+
Maximum depth of directory traversal (default: 20).
142146
exclude_patterns : str | set[str] | None
143-
Pattern or set of patterns specifying which files to exclude. If ``None``, no files are excluded.
147+
Glob patterns for pruning the file set.
148+
include_patterns : str | set[str] | None
149+
Glob patterns for including files in the output.
144150
branch : str | None
145-
The branch to clone and ingest (default: the default branch).
151+
Git branch to clone and ingest (default: the default branch).
146152
tag : str | None
147-
The tag to clone and ingest. If ``None``, no tag is used.
153+
Git tag to to clone and ingest. If ``None``, no tag is used.
148154
include_gitignored : bool
149155
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
150156
include_submodules : bool
@@ -153,7 +159,7 @@ def ingest(
153159
GitHub personal access token (PAT) for accessing private repositories.
154160
Can also be set via the ``GITHUB_TOKEN`` environment variable.
155161
output : str | None
156-
File path where the summary and content should be written.
162+
File path where the summary and content is written.
157163
If ``"-"`` (dash), the results are written to ``stdout``.
158164
If ``None``, the results are not written to a file.
159165
@@ -177,8 +183,8 @@ def ingest(
177183
max_files=max_files,
178184
max_total_size_bytes=max_total_size_bytes,
179185
max_directory_depth=max_directory_depth,
180-
include_patterns=include_patterns,
181186
exclude_patterns=exclude_patterns,
187+
include_patterns=include_patterns,
182188
branch=branch,
183189
tag=tag,
184190
include_gitignored=include_gitignored,

src/gitingest/ingestion.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
from pathlib import Path
66
from typing import TYPE_CHECKING
77

8-
from gitingest.config import MAX_DIRECTORY_DEPTH, MAX_FILES, MAX_TOTAL_SIZE_BYTES
98
from gitingest.output_formatter import format_node
109
from gitingest.schemas import FileSystemNode, FileSystemNodeType, FileSystemStats
1110
from gitingest.utils.ingestion_utils import _should_exclude, _should_include
@@ -113,7 +112,7 @@ def _process_node(node: FileSystemNode, query: IngestionQuery, stats: FileSystem
113112
if sub_path.stat().st_size > query.max_file_size:
114113
print(f"Skipping file {sub_path}: would exceed max file size limit")
115114
continue
116-
_process_file(path=sub_path, parent_node=node, stats=stats, local_path=query.local_path, query=query)
115+
_process_file(path=sub_path, parent_node=node, stats=stats, query=query)
117116
elif sub_path.is_dir():
118117
child_directory_node = FileSystemNode(
119118
name=sub_path.name,
@@ -167,7 +166,7 @@ def _process_symlink(path: Path, parent_node: FileSystemNode, stats: FileSystemS
167166
parent_node.file_count += 1
168167

169168

170-
def _process_file(path: Path, parent_node: FileSystemNode, stats: FileSystemStats, local_path: Path, query: IngestionQuery) -> None:
169+
def _process_file(path: Path, parent_node: FileSystemNode, stats: FileSystemStats, query: IngestionQuery) -> None:
171170
"""Process a file in the file system.
172171
173172
This function checks the file's size, increments the statistics, and reads its content.
@@ -181,8 +180,6 @@ def _process_file(path: Path, parent_node: FileSystemNode, stats: FileSystemStat
181180
The dictionary to accumulate the results.
182181
stats : FileSystemStats
183182
Statistics tracking object for the total file count and size.
184-
local_path : Path
185-
The base path of the repository or directory being processed.
186183
query : IngestionQuery
187184
The query object containing the limit configurations.
188185
@@ -193,7 +190,9 @@ def _process_file(path: Path, parent_node: FileSystemNode, stats: FileSystemStat
193190

194191
file_size = path.stat().st_size
195192
if stats.total_size + file_size > query.max_total_size_bytes:
196-
print(f"Skipping file {path}: would exceed total size limit")
193+
print(
194+
f"Skipping file {path}: would exceed total size limit ({query.max_total_size_bytes / 1024 / 1024:.1f}MB)",
195+
)
197196
return
198197

199198
stats.total_files += 1
@@ -204,7 +203,7 @@ def _process_file(path: Path, parent_node: FileSystemNode, stats: FileSystemStat
204203
type=FileSystemNodeType.FILE,
205204
size=file_size,
206205
file_count=1,
207-
path_str=str(path.relative_to(local_path)),
206+
path_str=str(path.relative_to(query.local_path)),
208207
path=path,
209208
depth=parent_node.depth + 1,
210209
)

0 commit comments

Comments
 (0)