split-python4gpt is a Python tool designed to process and reorganize large Python projects into minified, type-annotated, and token-limited files. This is particularly useful for preparing Python codebases for analysis or processing by Large Language Models (LLMs) like OpenAI’s GPT series, allowing them to handle the data in manageable chunks.
What is split-python4gpt?
It’s a command-line and programmatic tool that takes a Python file or an entire project directory as input and performs several operations:
pytype to infer type hints and add them to your code.python-minifier, with granular control over various minification aspects (removing docstrings, comments, annotations, renaming variables, etc.).... and a concise, AI-generated summary (requires an OpenAI API key).Who is it for?
Why is it useful?
split-python4gpt breaks down large codebases into chunks that fit these limits..py files in a project.pytype to add type annotations.python-minifier with numerous configurable options:
mini_docs).mini_globs, mini_locs).mini_lits).mini_annotations).assert and debugging statements (mini_asserts, mini_debug).mini_imports).object base from classes (mini_obj).pass statements (mini_pass).mini_posargs).return None (mini_retnone).mini_shebang)....) and a short summary generated via an OpenAI model (e.g., gpt-3.5-turbo).tiktoken to count tokens (compatible with OpenAI models) and splits the combined, processed code from all input files into multiple output files, ensuring each part is below a specified token limit..pyi files).Prerequisites:
export OPENAI_API_KEY="your_api_key_here".pytype is used for type inference. While listed as a dependency, ensure it’s correctly installed and accessible in your environment, especially if using virtual environments or specific Python versions. split-python4gpt looks for a Python executable matching the version it’s configured for (default 3.10, e.g., python3.10).Use our installation script for the easiest setup:
curl -sSL https://raw.githubusercontent.com/twardoch/split-python4gpt/main/scripts/install.sh | bash
This script will automatically detect your system and choose the best installation method (pip or binary).
python3.10 -m venv .venv
source .venv/bin/activate
split-python4gpt using pip:
pip install split-python4gpt
This will also install its dependencies: fire, tiktoken, python-minifier, pytype, and simpleaichat.
Download the latest binary for your platform from the releases page:
mdsplit4gpt-linux-x86_64mdsplit4gpt-macos-x86_64mdsplit4gpt-windows-x86_64.exeMake the binary executable and move it to a directory in your PATH:
# Linux/macOS
chmod +x mdsplit4gpt-linux-x86_64
mv mdsplit4gpt-linux-x86_64 ~/.local/bin/mdsplit4gpt
# Windows
# Simply run the .exe file or add it to your PATH
For developers or if you want the latest features:
git clone https://github.com/twardoch/split-python4gpt.git
cd split-python4gpt
./scripts/install-dev.sh
split-python4gpt can be used both as a command-line tool and programmatically in your Python scripts.
The primary command is mdsplit4gpt.
mdsplit4gpt [PATH_OR_FOLDER] [OPTIONS]
Key Arguments & Options:
path_or_folder (str |
Path): Path to the input Python file or folder. |
--out (str |
Path | None): Output folder for processed files. Defaults to the input folder (modifies files in place if not set). |
--pyis (str |
Path | None): Directory for storing generated .pyi files (type stubs from pytype). Defaults to the output folder. |
--types (bool, default: True): Infer types using PyType. Set to --types=False to disable.--mini (bool, default: True): Minify the Python scripts. Set to --mini=False to disable.--mini is True, unless specified):
--mini_docs (bool): Remove docstrings.--mini_globs (bool, default: False): Rename global names.--mini_locs (bool, default: False): Rename local names.--mini_lits (bool): Hoist literal statements. (Note: python-minifier default for this is False, but split-python4gpt defaults it to True via its main function argument default, though the class PyTypingMinifier itself has hoist_literals=False as its internal default for minify calls if not overridden).--mini_annotations (bool): Remove annotations.--mini_asserts (bool): Remove asserts.--mini_debug (bool): Remove debugging statements.--mini_imports (bool): Combine imports.--mini_obj (bool): Remove object base.--mini_pass (bool): Remove pass statements.--mini_posargs (bool): Convert positional to keyword args.--mini_retnone (bool): Remove explicit return None statements.--mini_shebang (bool): Remove shebang. (Set --mini_shebang=False to preserve shebang).PyLLMSplitter class, implicitly used by mdsplit4gpt):
gptok_model, gptok_limit, gptok_threshold yet. These are currently hardcoded or have defaults in PyLLMSplitter. For custom LLM splitting parameters, programmatic usage is recommended.Example Usage:
output_dir:
mdsplit4gpt my_script.py --out output_dir
This will create output_dir/my_script.py (processed) and output_dir/split4gpt/split1.py (and potentially more splits).
my_project/, disable type inference, keep docstrings, output to processed_project/:
mdsplit4gpt my_project/ --out processed_project/ --types=False --mini_docs=False
This will create processed_project/my_project/... (processed files) and processed_project/my_project/split4gpt/split1.py, etc.
You can use the core classes PyTypingMinifier and PyLLMSplitter directly in your Python code for more control.
from pathlib import Path
from split_python4gpt import PyLLMSplitter # Or PyTypingMinifier for just types/minification
# Ensure OPENAI_API_KEY is set as an environment variable if using summarization features
# import os
# os.environ["OPENAI_API_KEY"] = "your_api_key"
# Initialize the splitter
# You can specify gptok_model, gptok_limit, gptok_threshold here
splitter = PyLLMSplitter(
gptok_model="gpt-3.5-turbo",
gptok_limit=4000,
gptok_threshold=200 # Code sections over this token count might be summarized
)
input_path = "path/to/your/python_project_or_file"
output_dir = "path/to/output_directory"
pyi_dir = "path/to/pyi_files_directory" # Can be the same as output_dir
# Process the Python code
# minify_options can be passed as kwargs, e.g., remove_literal_statements=False
processed_file_paths = splitter.process_py(
py_path_or_folder=input_path,
out_py_folder=output_dir,
pyi_folder=pyi_dir,
types=True, # Enable type inference
mini=True, # Enable minification
# Minifier options:
remove_literal_statements=True, # Equivalent to mini_docs=True
rename_globals=False,
# ... other minifier options from python-minifier ...
)
# Write the split files for LLM consumption
splitter.write_splits() # This will create a 'split4gpt' subdirectory in output_dir
print(f"Processed files: {processed_file_paths}")
print(f"LLM splits written to: {Path(output_dir) / 'split4gpt'}")
The tool operates in several stages:
*.py files within that folder.PyTypingMinifier.init_folders, PyTypingMinifier.init_code_data):
.pyi (type stub) directories are resolved and created if they don’t exist.out is different from the input path.PyTypingMinifier.process_py which calls infer_types and minify):
types=True, pytype is invoked as a subprocess for the current file.pytype generates a .pyi stub file..pyi file is then merged back into the Python source code using pytype.tools.merge_pyi.pytype execution are caught, and a warning is logged; processing continues.mini=True, the (potentially type-annotated) code is passed to python-minifier.PyLLMSplitter.process_py_code):
PyLLMSplitter is used (which is the case for the mdsplit4gpt CLI tool).FunctionDef) or class (ClassDef):
tiktoken.gptok_threshold (default 128):
PyBodySummarizer (an ast.NodeTransformer) is invoked.PyBodySummarizer attempts to generate a concise summary of the function/class body using simpleaichat (which calls an OpenAI GPT model)....).PyLLMSplitter.write_splits):
# File: <original_filepath> comment before the sections of each new file.gptok_limit (default based on gptok_model, e.g., 4096 for gpt-3.5-turbo):
splitN.py (e.g., split1.py, split2.py) in a split4gpt subdirectory within the main output folder.Output Structure:
out directory is specified, minified/type-annotated versions of your original Python files are placed there, maintaining the original directory structure. If out is not specified, original files are modified in place (use with caution!)..pyi files: If pyis directory is specified (defaults to out directory), pytype will generate .pyi stub files there (typically within a .pytype/pyi/ subfolder structure).split4gpt directory: Inside the out directory (or input directory if out is not set), a split4gpt subdirectory is created. This contains the splitN.py files, which are the final chunks intended for LLMs.PyTypingMinifier:
.pyi stubs.pytype for type inference and python-minifier for code minification.PyBodySummarizer(ast.NodeTransformer):
PyLLMSplitter.FunctionDef and ClassDef nodes in an AST.... and an AI-generated docstring summary.PyLLMSplitter(PyTypingMinifier):
tiktoken to count tokens accurately for OpenAI models.PyBodySummarizer to condense oversized code elements.splitN.py) based on gptok_limit.simpleaichat to interact with an OpenAI API for the summarization feature.Contributions are welcome! Please follow these guidelines:
git checkout -b feature/your-feature-name or git checkout -b fix/your-bug-fix../scripts/install-dev.sh
This script will:
# Run all tests
./scripts/build-and-test.sh
# Run with coverage
./scripts/build-and-test.sh --with-coverage
# Run performance tests
./scripts/build-and-test.sh --with-performance
# Run individual test categories
pytest -v # All tests
pytest -v -m performance # Performance tests only
pytest -v tests/test_cli.py # CLI tests only
blackisortflake8 is used for lintingpre-commit hooks run automatically before commitstests/ directory.git push origin feature/your-feature-name.main branch of the original repository.This project uses git-tag-based semantic versioning with automated releases:
./scripts/release.sh 1.2.3
This script will:
./scripts/install-dev.sh - Development environment setup./scripts/build-and-test.sh - Comprehensive testing./scripts/release.sh <version> - Create a new release./scripts/get_version.py - Get current version./scripts/validate_tag.py <version> - Validate version formatThis project is licensed under the Apache License 2.0. See the LICENSE.txt file for details.
This project was scaffolded using PyScaffold.