Easy cloud compute for Comind (Coming Soon)

Easy cloud compute for Comind (Coming Soon)

May 1, 2025

Claude summary: This post introduces a new way to run Comind in the cloud using Modal and vLLM that will soon be available. We’re creating simple deployment scripts that will handle API key management, container warming, and server deployment. If you don’t have access to powerful GPUs locally, you will soon be able to easily deploy Comind to Modal’s cloud infrastructure with just a few commands.

Cameron’s note

Hey – I’m trying something different here.

I vibe code Comind quite a bit. I’m going to have Claude write up our changes in this (and future) devlogs. Claude and I are going to write this together, hence the use of “we” and “us”. The tone is also occasionally stranger than I would write, but that’s AI shit for you.

Lots of this is boring and slop-y but it contains accurate information that will be useful later.

Running LLMs in the Cloud for Comind

A big issue with Comind is that it requires a set of structured output features that are not supported by commercial providers, so you have to run the model yourself. Most people don’t have a giant GPU like I do, so I wanted to provide a simple way to run the model in the cloud.

I’ve chosen to use Modal, a straightforward and easy service to deploy a vLLM server. vLLM is probably the best inference server available, and has a tight integration with several structured output libraries. An additional benefit of using Modal is that it will be easy to deploy a server that can be accessed by any client that supports the OpenAI API.

Upcoming Modal Deployment Process

I’m currently addressing some compatibility issues with Modal. Here’s the planned deployment process that will be available soon:

Option 1: Quick Interactive Setup (Recommended)

You’ll be able to use our deployment helper script that will guide you through the process:

# Deploy with interactive setup
python modal_deploy.py deploy

# Create a secret without deploying
python modal_deploy.py create-secret

# Check deployment status
python modal_deploy.py status

This guides you through selecting models to deploy and handles container warming automatically.

Manual Deployment

If you prefer to set things up manually, you’ll be able to:

  1. Create a Secret (recommended):

    modal secret create comind-api-key --value "your-api-key-here"
  2. Deploy the application:

    modal deploy modal_inference.py
  3. Keep containers warm:

    python modal_deploy.py warm

Troubleshooting

If you encounter deployment errors once this feature is released:

  1. Secret not found: You’ll be able to either:

    • Create the secret as shown above
    • Continue without a secret (a default API key will be used)
  2. Deprecation warnings: These will be informational for future Modal updates and shouldn’t affect functionality.

  3. Authorization errors: You’ll need to make sure your client is using the same API key as your server:

    @app.function(
        # Other parameters...
        secrets=[api_key_secret],
    )
  4. Access via environment variables:

    def get_api_key():
        import os
        return os.environ.get("api_key", "comind-api-key")

This approach fixed the errors users were seeing like AttributeError: 'Secret' object has no attribute 'get'.

Using the API

After deployment, you’ll have OpenAI-compatible endpoints for each model:

from openai import OpenAI

client = OpenAI(
    api_key="comind-api-key",  # Must match the server configuration
    base_url="https://YOUR_WORKSPACE--comind-vllm-inference-serve-phi4.modal.run/v1"
)

response = client.chat.completions.create(
    model="microsoft/Phi-4",
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

For testing, you can use the included client script:

python modal_client.py --workspace YOUR_WORKSPACE --prompt "Tell me about Comind"

Configuring Comind

To connect your Comind instance to the Modal servers, update your .env file:

# LLM server info
COMIND_LLM_SERVER_URL = https://YOUR_WORKSPACE_NAME--comind-vllm-inference-serve-model.modal.run/v1/
COMIND_LLM_SERVER_API_KEY= "comind-api-key"

# Embedding server info
COMIND_EMBEDDING_SERVER_URL = https://YOUR_WORKSPACE_NAME--comind-vllm-inference-embeddings.modal.run/v1/
COMIND_EMBEDDING_SERVER_API_KEY= "comind-api-key"

The inference server will use the API key comind-api-key by default. You’ll be able to change this in the modal_client.py script:

# Configuration options (can be modified as needed)
MINUTES = 60  # seconds
VLLM_PORT = 8000
API_KEY = "comind-api-key"  # Replace with a secret for production use

After deployment, you’ll be able to access your models through the OpenAI client:

from openai import OpenAI

client = OpenAI(
    api_key="comind-api-key",  # Must match API_KEY in modal_inference.py
    base_url="https://YOUR_WORKSPACE--comind-vllm-inference-serve-phi4.modal.run/v1"
)

# vLLM only supports one model at a time, so we need to get the first one
model_id = client.models.list().data[0].id

response = client.chat.completions.create(
    model=model_id,
    messages=[{"role": "user", "content": "What is Comind?"}]
)

(it will not know what Comind is for sure)

I’m also working on a simple client script that you’ll be able to use to test the inference server:

python modal_client.py --workspace YOUR_WORKSPACE --prompt "Tell me about Comind"

The initial version will only support Phi-4. PRs will be welcome to add more models!

Keep in mind that the server will have a warmup time, so it may take a while for it to boot up.

Addressing Cold Start Problems

One common issue with Modal and similar serverless platforms is cold start time - the delay when a new container needs to be initialized. The planned solutions include:

  1. Keeping containers warm by adding these parameters to Modal functions:
@app.function(
    image=vllm_image,
    gpu="A10G",
    volumes={...},
    min_containers=1,  # Keep at least one container warm at all times
    buffer_containers=1,  # Provision one extra container when active
    scaledown_window=10 * MINUTES,  # Wait longer before scaling down
)
  1. Updating immediately after deployment with:
if __name__ == "__main__":
    if len(sys.argv) > 1 and sys.argv[1] == "deploy":
        app.deploy()
        # Keep containers warm after deployment
        serve_phi4.keep_warm(1)
  1. Scheduling warm container adjustments based on time of day:
@app.function(schedule=modal.Cron("0 * * * *"))
def adjust_warm_containers():
    """Adjust warm containers based on time of day."""
    # During peak hours, keep more warm
    serve_phi4.keep_warm(2)
    # During off-peak, keep at least one
    serve_phi4.keep_warm(1)

Preventing Authorization Errors

To prevent authorization errors, you’ll need to make sure:

  1. The API key in your client matches the server:

    # In modal_inference.py
    API_KEY = "comind-api-key"  
    
    # In your client
    client = OpenAI(
        api_key="comind-api-key",  # MUST MATCH
        base_url="https://..."
    )
  2. The endpoint URL is correct and includes /v1 at the end.

  3. No typos in either the API key or URL.

Easier Deployment with modal_deploy.py

I’m also creating a deployment helper script that will simplify the process and solve cold start issues automatically:

# Deploy and keep containers warm in one step
python modal_deploy.py deploy

# Just warm up existing containers anytime
python modal_deploy.py warm

# Check the status of your deployments
python modal_deploy.py status

This script will automatically keep containers warm after deployment and show you your endpoint URLs based on your Modal workspace name, providing a much better experience than the manual deployment process.

Securing API Keys

Instead of hardcoding API keys (which is never a good idea), the upcoming version will use Modal’s built-in secret management:

# Create a secure API key (run this once)
modal secret create comind-api-key --value "your-secure-key-here"

The deployment script will automatically check if this secret exists and create it with a default value if needed. This will provide three benefits:

  1. Your API key won’t be stored in source code
  2. You’ll be able to rotate keys without changing code
  3. The same key will be consistently used across all services

In your client code, you’ll use this same key:

client = OpenAI(
    api_key="your-secure-key-here",  # Same value from your Modal secret
    base_url="https://YOUR_WORKSPACE--comind-vllm-inference-serve-phi4.modal.run/v1"
)

This will eliminate the “unauthorized” errors that happen when keys don’t match between client and server.

Handling Modal Secrets Properly

There’s an important note regarding Modal secrets handling that will be implemented. If you see errors like AttributeError: 'Secret' object has no attribute 'get' or TypeError: _App.function() got an unexpected keyword argument 'env', the code will include solutions to these issues.

Modal’s Secret API works differently than initially expected. Here’s how Modal secrets will be implemented:

  1. Creating a secret:

    # Create a secret from a dictionary
    api_key_secret = modal.Secret.from_dict({"api_key": "comind-api-key"})
    
    # Or reference an existing named secret
    api_key_secret = modal.Secret.from_name("comind-api-key")
  2. Passing the secret to functions:

    @app.function(
        image=vllm_image,
        secrets=[api_key_secret],  # Pass the secret as a list
        # other parameters...
    )
    def serve_phi4():
        # function code...
  3. Accessing the secret in functions:

    def get_api_key():
        """Get the API key from the environment."""
        import os
        # Modal automatically injects secret values as environment variables
        return os.environ.get("api_key", "comind-api-key")

The secret values will be injected as environment variables in your container, so you’ll access them with os.environ. This pattern will be implemented in all the Modal functions in our codebase.

Current Development Status

As we prepare to release this feature, we’re working on resolving several issues with the Modal interface:

  1. API Integration Issues: We’re developing solutions for consistent responses when connecting Comind instances to Modal-hosted inference servers, focusing on how API endpoints handle various request formats.

  2. Container Warmup Reliability: We’re fine-tuning container management logic to minimize cold start delays, ensuring a more responsive experience.

  3. Authentication Edge Cases: We’re implementing more robust error handling for authentication between clients and servers to make troubleshooting easier.

Once these issues are resolved, we’ll release the Modal cloud deployment feature. We look forward to your feedback and contributions when this functionality becomes available.

(yo this guy is very self-serious)

– Cameron