Development

Contributing to Cerebrate File development

Getting Started
1. Prerequisites
2. Setting Up Development Environment
Project Structure
Development Workflow
Coding Standards
Testing Guidelines
1. Test Structure
2. Integration Tests
Performance Optimization
1. Profiling
2. Memory Optimization
Release Process
Contributing Guidelines
Debugging Tips
Resources
License

Getting Started

Prerequisites

Python 3.9+ (recommended: 3.12)
uv package manager
Git
GitHub account (for contributions)

Setting Up Development Environment

Clone the repository:

git clone https://github.com/twardoch/cerebrate-file.git
cd cerebrate-file

Create virtual environment:

uv venv --python 3.12
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

uv pip install -e .
uv add --dev pytest pytest-cov pytest-mock rich loguru

Set up pre-commit hooks (optional but recommended):
```
uv add --dev pre-commit
pre-commit install
```

Project Structure

cerebrate-file/
├── src/
│   └── cerebrate_file/
│       ├── __init__.py         # Package initialization
│       ├── cli.py              # CLI interface
│       ├── api_client.py       # Cerebras API client
│       ├── cerebrate_file.py   # Core processing logic
│       ├── chunking.py         # Chunking strategies
│       ├── config.py           # Configuration management
│       ├── constants.py        # Constants and defaults
│       ├── models.py           # Data models
│       ├── recursive.py        # Recursive processing
│       ├── tokenizer.py        # Token counting
│       ├── ui.py              # UI components
│       └── utils.py           # Utility functions
├── tests/
│   ├── test_api_client.py     # API client tests
│   ├── test_chunking.py       # Chunking tests
│   ├── test_cli.py            # CLI tests
│   ├── test_config.py         # Configuration tests
│   ├── test_integration.py    # Integration tests
│   ├── test_recursive.py      # Recursive processing tests
│   └── test_ui.py             # UI component tests
├── docs/                      # Documentation (Jekyll)
├── examples/                  # Example scripts
├── pyproject.toml             # Package configuration
├── README.md                  # Project README
├── CHANGELOG.md               # Version history
├── LICENSE                    # Apache 2.0 license
└── .gitignore                 # Git ignore rules

Development Workflow

1. Create a Feature Branch

git checkout -b feature/your-feature-name

2. Make Changes

Follow the coding standards and guidelines below.

3. Write Tests

Every new feature must have tests:

# tests/test_your_feature.py
import pytest
from cerebrate_file.your_module import your_function

def test_your_function():
    """Test basic functionality."""
    result = your_function("input")
    assert result == "expected"

def test_your_function_edge_case():
    """Test edge cases."""
    with pytest.raises(ValueError):
        your_function(None)

4. Run Tests

# Run all tests
python -m pytest

# Run with coverage
python -m pytest --cov=cerebrate_file --cov-report=term-missing

# Run specific test file
python -m pytest tests/test_your_feature.py -xvs

# Run tests in watch mode
uvx pytest-watch

5. Check Code Quality

# Format code
uvx ruff format src/ tests/

# Lint code
uvx ruff check src/ tests/ --fix

# Type checking
uvx mypy src/cerebrate_file

# Security scan
uvx bandit -r src/

6. Update Documentation

Update relevant documentation in docs/
Update README.md if needed
Add to CHANGELOG.md

7. Commit Changes

git add .
git commit -m "feat: add your feature description"

Use conventional commits:

feat: New feature
fix: Bug fix
docs: Documentation
style: Formatting
refactor: Code restructuring
test: Tests
chore: Maintenance

8. Push and Create PR

git push origin feature/your-feature-name

Then create a pull request on GitHub.

Coding Standards

Python Style Guide

Follow PEP 8 with these specifics:

# this_file: src/cerebrate_file/example.py
"""Module docstring describing purpose."""

from typing import Optional, List, Dict
from pathlib import Path

# Constants in UPPER_CASE
DEFAULT_CHUNK_SIZE = 32000
MAX_RETRIES = 3

class ExampleClass:
    """Class docstring with description."""

    def __init__(self, param: str) -> None:
        """Initialize with parameter."""
        self.param = param

    def process(self, data: str) -> str:
        """
        Process data with clear description.

        Args:
            data: Input data to process

        Returns:
            Processed result

        Raises:
            ValueError: If data is invalid
        """
        if not data:
            raise ValueError("Data cannot be empty")

        # Clear comment explaining logic
        result = self._transform(data)
        return result

    def _transform(self, data: str) -> str:
        """Private method with underscore prefix."""
        return data.upper()

Type Hints

Always use type hints:

from typing import Optional, List, Dict, Union, Tuple

def process_files(
    paths: List[Path],
    options: Optional[Dict[str, str]] = None
) -> Tuple[List[str], List[str]]:
    """Process multiple files."""
    successes: List[str] = []
    failures: List[str] = []

    for path in paths:
        try:
            result = process_single(path, options or {})
            successes.append(result)
        except Exception as e:
            failures.append(str(e))

    return successes, failures

Error Handling

Use specific exceptions:

class CerebrateError(Exception):
    """Base exception for cerebrate-file."""
    pass

class ConfigurationError(CerebrateError):
    """Configuration related errors."""
    pass

class APIError(CerebrateError):
    """API related errors."""
    pass

def validate_config(config: Dict) -> None:
    """Validate configuration."""
    if not config.get("api_key"):
        raise ConfigurationError("API key is required")

    if config.get("chunk_size", 0) < 1000:
        raise ConfigurationError("Chunk size must be at least 1000")

Logging

Use loguru for logging:

from loguru import logger

def process_document(path: str, verbose: bool = False) -> str:
    """Process document with logging."""
    if verbose:
        logger.enable("cerebrate_file")
    else:
        logger.disable("cerebrate_file")

    logger.debug(f"Processing {path}")

    try:
        result = do_processing(path)
        logger.info(f"Successfully processed {path}")
        return result
    except Exception as e:
        logger.error(f"Failed to process {path}: {e}")
        raise

Documentation

Write comprehensive docstrings:

def complex_function(
    input_data: str,
    chunk_size: int = 32000,
    strategy: str = "markdown"
) -> List[str]:
    """
    Split input data into processable chunks.

    This function takes large input data and splits it into smaller
    chunks suitable for processing by the AI model. It maintains
    context between chunks using overlap samples.

    Args:
        input_data: The raw input text to be chunked
        chunk_size: Maximum size of each chunk in tokens (default: 32000)
        strategy: Chunking strategy - 'text', 'semantic', 'markdown', or 'code'

    Returns:
        List of text chunks ready for processing

    Raises:
        ValueError: If input_data is empty or strategy is invalid
        TokenLimitError: If a single unit exceeds chunk_size

    Examples:
        >>> chunks = complex_function("Long text...", chunk_size=16000)
        >>> len(chunks)
        3

        >>> chunks = complex_function("# Markdown", strategy="markdown")
        >>> chunks[0].startswith("#")
        True

    Note:
        The actual chunk size may be slightly smaller than specified
        to avoid breaking in the middle of sentences or code blocks.
    """

Testing Guidelines

Test Structure

import pytest
from unittest.mock import Mock, patch
from cerebrate_file.module import function_to_test

class TestFeatureName:
    """Test suite for feature."""

    def setup_method(self):
        """Set up test fixtures."""
        self.test_data = "sample"

    def test_normal_case(self):
        """Test normal operation."""
        result = function_to_test(self.test_data)
        assert result == "expected"

    def test_edge_case(self):
        """Test edge cases."""
        assert function_to_test("") == ""
        assert function_to_test(None) is None

    @pytest.mark.parametrize("input,expected", [
        ("test1", "result1"),
        ("test2", "result2"),
        ("test3", "result3"),
    ])
    def test_multiple_cases(self, input, expected):
        """Test multiple scenarios."""
        assert function_to_test(input) == expected

    def test_error_handling(self):
        """Test error conditions."""
        with pytest.raises(ValueError, match="Invalid input"):
            function_to_test("invalid")

    @patch('cerebrate_file.module.external_function')
    def test_with_mock(self, mock_func):
        """Test with mocked dependencies."""
        mock_func.return_value = "mocked"
        result = function_to_test("input")
        mock_func.assert_called_once_with("input")
        assert result == "mocked"

Integration Tests

# tests/test_integration.py
import tempfile
from pathlib import Path
from cerebrate_file import process_document

def test_end_to_end_processing():
    """Test complete processing pipeline."""
    with tempfile.TemporaryDirectory() as tmpdir:
        # Create test file
        input_file = Path(tmpdir) / "test.md"
        input_file.write_text("# Test\nContent")

        # Process
        output_file = Path(tmpdir) / "output.md"
        process_document(
            input_data=str(input_file),
            output_data=str(output_file),
            prompt="Add emoji"
        )

        # Verify
        assert output_file.exists()
        content = output_file.read_text()
        assert "Test" in content

Performance Optimization

Profiling

import cProfile
import pstats
from io import StringIO

def profile_function():
    """Profile function performance."""
    profiler = cProfile.Profile()
    profiler.enable()

    # Code to profile
    result = expensive_function()

    profiler.disable()
    stream = StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.sort_stats('cumulative')
    stats.print_stats(10)
    print(stream.getvalue())

    return result

Memory Optimization

from memory_profiler import profile

@profile
def memory_intensive_function():
    """Monitor memory usage."""
    # Process in chunks to reduce memory
    for chunk in generate_chunks(large_data):
        process_chunk(chunk)
        del chunk  # Explicit cleanup

def generate_chunks(data, chunk_size=1000):
    """Generator to avoid loading all data."""
    for i in range(0, len(data), chunk_size):
        yield data[i:i + chunk_size]

Release Process

1. Update Version

Edit pyproject.toml:

[project]
version = "1.1.0"

2. Update Changelog

Add to CHANGELOG.md:

## [1.1.0] - 2024-01-15

### Added
- New feature description

### Changed
- Modified behavior

### Fixed
- Bug fixes

3. Run Tests

python -m pytest --cov=cerebrate_file

4. Build Package

uv build

5. Test Package

uv pip install dist/cerebrate_file-1.1.0-py3-none-any.whl
cerebrate-file --version

6. Tag Release

git tag -a v1.1.0 -m "Release version 1.1.0"
git push origin v1.1.0

7. Publish to PyPI

uv publish

Contributing Guidelines

Code of Conduct

Be respectful and inclusive
Welcome newcomers
Focus on constructive feedback
Report inappropriate behavior

Pull Request Process

Fork the repository
Create feature branch
Write tests for new code
Ensure all tests pass
Update documentation
Submit pull request

Pull Request Template

## Description
Brief description of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Tests pass locally
- [ ] Added new tests
- [ ] Coverage maintained

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-reviewed code
- [ ] Updated documentation
- [ ] Added to CHANGELOG.md

Debugging Tips

Using pdb

import pdb

def debug_function(data):
    """Debug with pdb."""
    pdb.set_trace()  # Breakpoint
    result = process(data)
    return result

Verbose Logging

from loguru import logger
import sys

# Configure detailed logging
logger.remove()
logger.add(
    sys.stderr,
    format="<green>{time:HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan>:<cyan>{line}</cyan> - <level>{message}</level>",
    level="DEBUG"
)

Environment Variables

# Enable debug mode
export CEREBRATE_DEBUG=1
export LOGURU_LEVEL=DEBUG

# Run with debugging
python -m cerebrate_file.cli --verbose

Resources

Documentation

Tools

uv - Fast Python package manager
ruff - Fast Python linter
mypy - Static type checker
pre-commit - Git hook framework

Community

License

By contributing, you agree that your contributions will be licensed under the Apache 2.0

Development

Table of contents

Getting Started

Prerequisites

Setting Up Development Environment

Project Structure

Development Workflow

1. Create a Feature Branch

2. Make Changes

3. Write Tests

4. Run Tests

5. Check Code Quality

6. Update Documentation

7. Commit Changes

8. Push and Create PR

Coding Standards

Python Style Guide

Type Hints

Error Handling

Logging

Documentation

Testing Guidelines

Test Structure

Integration Tests

Performance Optimization

Profiling

Memory Optimization

Release Process

1. Update Version

2. Update Changelog

3. Run Tests

4. Build Package

5. Test Package

6. Tag Release

7. Publish to PyPI

Contributing Guidelines

Code of Conduct

Pull Request Process

Pull Request Template

Debugging Tips

Using pdb

Verbose Logging

Environment Variables

Resources

Documentation

Tools

Community

License