split-python4gpt

PyPI version License Docs CI

split-python4gpt is a Python tool designed to process and reorganize large Python projects into minified, type-annotated, and token-limited files. This is particularly useful for preparing Python codebases for analysis or processing by Large Language Models (LLMs) like OpenAI’s GPT series, allowing them to handle the data in manageable chunks.

Overview

What is split-python4gpt?

It’s a command-line and programmatic tool that takes a Python file or an entire project directory as input and performs several operations:

  1. Type Inference: Optionally integrates with pytype to infer type hints and add them to your code.
  2. Minification: Optionally minifies the Python code using python-minifier, with granular control over various minification aspects (removing docstrings, comments, annotations, renaming variables, etc.).
  3. Code Summarization: For functions or classes exceeding a certain token threshold, their bodies can be replaced with ... and a concise, AI-generated summary (requires an OpenAI API key).
  4. Splitting for LLMs: The processed code (potentially from multiple files) is then split into smaller text files, each respecting a specified token limit, making it suitable for LLMs with context window constraints.

Who is it for?

Why is it useful?

Features

Installation

Prerequisites:

Use our installation script for the easiest setup:

curl -sSL https://raw.githubusercontent.com/twardoch/split-python4gpt/main/scripts/install.sh | bash

This script will automatically detect your system and choose the best installation method (pip or binary).

Manual Installation Options

Option 1: Install from PyPI (Python Package)

  1. It is recommended to install the tool in a virtual environment:
    python3.10 -m venv .venv
    source .venv/bin/activate
    
  2. Install split-python4gpt using pip:
    pip install split-python4gpt
    

    This will also install its dependencies: fire, tiktoken, python-minifier, pytype, and simpleaichat.

Option 2: Download Pre-built Binary

Download the latest binary for your platform from the releases page:

Make the binary executable and move it to a directory in your PATH:

# Linux/macOS
chmod +x mdsplit4gpt-linux-x86_64
mv mdsplit4gpt-linux-x86_64 ~/.local/bin/mdsplit4gpt

# Windows
# Simply run the .exe file or add it to your PATH

Option 3: Install from Source

For developers or if you want the latest features:

git clone https://github.com/twardoch/split-python4gpt.git
cd split-python4gpt
./scripts/install-dev.sh

Usage

split-python4gpt can be used both as a command-line tool and programmatically in your Python scripts.

Command-Line Interface (CLI)

The primary command is mdsplit4gpt.

mdsplit4gpt [PATH_OR_FOLDER] [OPTIONS]

Key Arguments & Options:

Example Usage:

  1. Process a single file, minify and infer types, output to output_dir:
    mdsplit4gpt my_script.py --out output_dir
    

    This will create output_dir/my_script.py (processed) and output_dir/split4gpt/split1.py (and potentially more splits).

  2. Process an entire project in my_project/, disable type inference, keep docstrings, output to processed_project/:
    mdsplit4gpt my_project/ --out processed_project/ --types=False --mini_docs=False
    

    This will create processed_project/my_project/... (processed files) and processed_project/my_project/split4gpt/split1.py, etc.

Programmatic Usage

You can use the core classes PyTypingMinifier and PyLLMSplitter directly in your Python code for more control.

from pathlib import Path
from split_python4gpt import PyLLMSplitter # Or PyTypingMinifier for just types/minification

# Ensure OPENAI_API_KEY is set as an environment variable if using summarization features
# import os
# os.environ["OPENAI_API_KEY"] = "your_api_key"

# Initialize the splitter
# You can specify gptok_model, gptok_limit, gptok_threshold here
splitter = PyLLMSplitter(
    gptok_model="gpt-3.5-turbo",
    gptok_limit=4000,
    gptok_threshold=200 # Code sections over this token count might be summarized
)

input_path = "path/to/your/python_project_or_file"
output_dir = "path/to/output_directory"
pyi_dir = "path/to/pyi_files_directory" # Can be the same as output_dir

# Process the Python code
# minify_options can be passed as kwargs, e.g., remove_literal_statements=False
processed_file_paths = splitter.process_py(
    py_path_or_folder=input_path,
    out_py_folder=output_dir,
    pyi_folder=pyi_dir,
    types=True,  # Enable type inference
    mini=True,   # Enable minification
    # Minifier options:
    remove_literal_statements=True, # Equivalent to mini_docs=True
    rename_globals=False,
    # ... other minifier options from python-minifier ...
)

# Write the split files for LLM consumption
splitter.write_splits() # This will create a 'split4gpt' subdirectory in output_dir

print(f"Processed files: {processed_file_paths}")
print(f"LLM splits written to: {Path(output_dir) / 'split4gpt'}")

Technical Deep Dive

How it Works

The tool operates in several stages:

  1. File Discovery:
    • If a single file path is provided, it’s processed.
    • If a folder path is provided, it recursively finds all *.py files within that folder.
  2. Initialization (PyTypingMinifier.init_folders, PyTypingMinifier.init_code_data):
    • Input, output, and .pyi (type stub) directories are resolved and created if they don’t exist.
    • Original files are copied to the output directory if out is different from the input path.
    • Data structures are prepared to hold code content and paths.
  3. Processing per file (PyTypingMinifier.process_py which calls infer_types and minify):
    • Type Inference (optional):
      • If types=True, pytype is invoked as a subprocess for the current file.
      • pytype generates a .pyi stub file.
      • The content of this .pyi file is then merged back into the Python source code using pytype.tools.merge_pyi.
      • Errors during pytype execution are caught, and a warning is logged; processing continues.
    • Minification (optional):
      • If mini=True, the (potentially type-annotated) code is passed to python-minifier.
      • Various minification options (passed from the CLI or programmatic call) control the minifier’s behavior (e.g., removing docstrings, renaming variables).
  4. Code Summarization and Sectioning for LLMs (PyLLMSplitter.process_py_code):
    • This step occurs after the initial type inference and minification if PyLLMSplitter is used (which is the case for the mdsplit4gpt CLI tool).
    • The code of each file is parsed into an Abstract Syntax Tree (AST).
    • Top-level nodes (imports, variable assignments, functions, classes) are processed.
    • For each function (FunctionDef) or class (ClassDef):
      • Its source code is minified (again, with docstrings preserved temporarily for summarization context).
      • Its token count is calculated using tiktoken.
      • If the token count exceeds gptok_threshold (default 128):
        • The PyBodySummarizer (an ast.NodeTransformer) is invoked.
        • PyBodySummarizer attempts to generate a concise summary of the function/class body using simpleaichat (which calls an OpenAI GPT model).
        • The original body of the function/class is replaced in the AST with this summary (as a docstring) and an ellipsis (...).
        • The modified AST node (with summarized body) is then converted back to minified source code.
    • The file is thus broken down into a list of “sections,” each being a string of minified Python code (e.g., an import block, a variable assignment, a function definition, a summarized function definition). Each section has its token count.
  5. Splitting for LLMs (PyLLMSplitter.write_splits):
    • All processed sections from all input files are collected.
    • The tool iterates through these sections, prepending a # File: <original_filepath> comment before the sections of each new file.
    • It accumulates sections into a “portion” of text, keeping track of the current token size.
    • If adding the next section (plus its file header if it’s from a new file) would exceed gptok_limit (default based on gptok_model, e.g., 4096 for gpt-3.5-turbo):
      • The current portion is written to a new file: splitN.py (e.g., split1.py, split2.py) in a split4gpt subdirectory within the main output folder.
      • A new portion is started.
    • Any remaining portion is written to a final split file.

Output Structure:

Core Components

Contributing

Contributions are welcome! Please follow these guidelines:

Development Setup

  1. Fork the repository on GitHub.
  2. Create a new branch for your feature or bug fix: git checkout -b feature/your-feature-name or git checkout -b fix/your-bug-fix.
  3. Set up the development environment:
    ./scripts/install-dev.sh
    

    This script will:

    • Create a virtual environment with Python 3.10
    • Install the package in development mode
    • Install all testing and development dependencies
    • Set up pre-commit hooks

Development Workflow

  1. Make your changes.
  2. Run tests and checks:
    # Run all tests
    ./scripts/build-and-test.sh
        
    # Run with coverage
    ./scripts/build-and-test.sh --with-coverage
        
    # Run performance tests
    ./scripts/build-and-test.sh --with-performance
        
    # Run individual test categories
    pytest -v                    # All tests
    pytest -v -m performance     # Performance tests only
    pytest -v tests/test_cli.py  # CLI tests only
    
  3. Code quality standards:
    • Code is formatted with black
    • Imports are sorted with isort
    • Follow PEP 8 guidelines
    • flake8 is used for linting
    • pre-commit hooks run automatically before commits
  4. Add tests for your changes in the tests/ directory.
  5. Commit your changes with a clear and descriptive commit message.
  6. Push your branch to your fork: git push origin feature/your-feature-name.
  7. Create a Pull Request (PR) against the main branch of the original repository.

Release Process

This project uses git-tag-based semantic versioning with automated releases:

  1. For maintainers creating releases:
    ./scripts/release.sh 1.2.3
    

    This script will:

    • Validate the version format
    • Run comprehensive tests
    • Update the changelog
    • Create and push a git tag
    • Trigger GitHub Actions for automated release
  2. Automated CI/CD:
    • On every push/PR: Tests run on Linux, macOS, and Windows
    • On git tags: Full release pipeline creates:
      • PyPI package publication
      • Multi-platform binary builds
      • GitHub release with assets
      • Automated changelog generation
  3. Available scripts:
    • ./scripts/install-dev.sh - Development environment setup
    • ./scripts/build-and-test.sh - Comprehensive testing
    • ./scripts/release.sh <version> - Create a new release
    • ./scripts/get_version.py - Get current version
    • ./scripts/validate_tag.py <version> - Validate version format

License

This project is licensed under the Apache License 2.0. See the LICENSE.txt file for details.

Authors

This project was scaffolded using PyScaffold.