# Document Operations

The Health Universe A2A SDK provides a powerful document client through the `context.document_client` property, allowing your agents to interact with documents in the current thread. This guide covers all the essential operations for working with documents.

## Overview

The document client provides access to documents stored in the Health Universe platform, with support for:

* **Listing** documents in the current thread
* **Reading** document content (both raw and extracted text)
* **Writing** new documents
* **Updating** existing documents
* **Searching** across documents (text and semantic search)
* **Processing status** tracking for extraction pipelines

## Basic Document Operations

### Listing Documents

The most common operation is listing documents available in the current thread:

```python
from health_universe_a2a import Agent, AgentContext

class DocumentProcessor(Agent):
    def get_agent_name(self) -> str:
        return "Document Processor"
    
    def get_agent_description(self) -> str:
        return "Processes and analyzes documents"
    
    async def process_message(self, message: str, context: AgentContext) -> str:
        # List all documents
        docs = await context.document_client.list_documents()
        
        # List only source documents (user uploads)
        source_docs = await context.document_client.list_documents(role="source")
        
        # List only artifacts (agent outputs)
        artifacts = await context.document_client.list_documents(role="artifact")
        
        # Include hidden documents
        all_docs = await context.document_client.list_documents(include_hidden=True)
        
        return f"Found {len(docs)} documents ({len(source_docs)} sources, {len(artifacts)} artifacts)"
```

### Reading Document Content

Once you have documents, you can read their content in several ways:

```python
async def process_message(self, message: str, context: AgentContext) -> str:
    docs = await context.document_client.list_documents(role="source")
    
    if not docs:
        return "No source documents found"
    
    doc = docs[0]
    
    # Read raw document as bytes (for binary files like PDFs)
    raw_content = await context.document_client.download(doc.id)
    
    # Read text documents directly
    if doc.filename.endswith(('.txt', '.csv', '.json', '.md')):
        text_content = await context.document_client.download_text(doc.id)
        print(f"Text content: {text_content[:100]}...")
    
    # Read platform-extracted text (for PDFs, DOCX, images)
    try:
        extracted_text = await context.document_client.download_extracted(doc.id)
        print(f"Extracted markdown: {extracted_text[:200]}...")
    except ValueError as e:
        print(f"Extraction not available: {e}")
    
    return f"Processed document: {doc.name}"
```

### Filtering Documents

Use the `filter_by_name()` method for quick document filtering:

```python
async def process_message(self, message: str, context: AgentContext) -> str:
    # Find documents with "protocol" in the name
    protocols = await context.document_client.filter_by_name("protocol")
    
    # Find JSON files
    json_docs = await context.document_client.filter_by_name(".json")
    
    # Find reports
    reports = await context.document_client.filter_by_name("report")
    
    return f"Found {len(protocols)} protocols, {len(json_docs)} JSON files, {len(reports)} reports"
```

## Writing and Updating Documents

### Creating New Documents

The `write()` method handles the complete upload process automatically:

```python
import json

async def process_message(self, message: str, context: AgentContext) -> str:
    # Analyze some data
    analysis_results = {
        "summary": "Analysis complete",
        "score": 0.95,
        "recommendations": ["Action 1", "Action 2"]
    }
    
    # Write JSON results
    await context.document_client.write(
        "Analysis Results",
        json.dumps(analysis_results, indent=2),
        filename="analysis.json"
    )
    
    # Write markdown report
    markdown_report = """# Analysis Report
    
## Summary
The analysis has been completed successfully.

## Key Findings
- Score: 0.95
- High confidence in results
- Two recommendations identified

## Recommendations
1. Action 1
2. Action 2
"""
    
    await context.document_client.write(
        "Clinical Analysis Report",
        markdown_report,
        filename="report.md"
    )
    
    # Write binary data (e.g., generated PDF)
    # pdf_bytes = generate_pdf_report(analysis_results)
    # await context.document_client.write(
    #     "PDF Report",
    #     pdf_bytes,
    #     filename="report.pdf"
    # )
    
    return "Analysis complete! Results saved as JSON and markdown."
```

### Updating Existing Documents

Update existing documents to create new versions:

```python
async def process_message(self, message: str, context: AgentContext) -> str:
    # Find existing results document
    results_docs = await context.document_client.filter_by_name("results")
    
    if results_docs:
        doc = results_docs[0]
        
        # Update with new data
        updated_results = {
            "timestamp": "2024-01-15T10:30:00Z",
            "version": 2,
            "data": "Updated analysis results"
        }
        
        await context.document_client.update(
            doc.id,
            json.dumps(updated_results, indent=2),
            comment="Updated with latest analysis"
        )
        
        return f"Updated {doc.name} to version 2"
    
    return "No results document found to update"
```

## Document Search

### Text Search

Perform keyword-based searches across document content:

```python
async def process_message(self, message: str, context: AgentContext) -> str:
    # Search for specific terms
    results = await context.document_client.search("medication dosage", limit=5)
    
    findings = []
    for result in results:
        findings.append(f"Found in {result.document_name} (chunk {result.chunk_index}): {result.content[:100]}...")
    
    return f"Found {len(results)} matches:\n" + "\n".join(findings)
```

### Semantic Search

Use AI-powered semantic search to find conceptually related content:

```python
async def process_message(self, message: str, context: AgentContext) -> str:
    # Semantic search finds related concepts, not just exact keywords
    results = await context.document_client.semantic_search(
        "What are the contraindications for this treatment?",
        max_results=3,
        similarity_threshold=0.6
    )
    
    findings = []
    for result in results:
        findings.append(
            f"[{result.similarity:.2f}] {result.document_name}: {result.content[:150]}..."
        )
    
    return f"Found {len(results)} relevant sections:\n" + "\n".join(findings)
```

## Processing Status and Waiting

### Checking Processing Status

Monitor document extraction status:

```python
async def process_message(self, message: str, context: AgentContext) -> str:
    docs = await context.document_client.list_documents(role="source")
    
    status_summary = []
    for doc in docs:
        status = await context.document_client.get_processing_status(doc.id)
        status_summary.append(
            f"{doc.name}: {status.status} "
            f"({'ready' if status.is_ready else 'processing'})"
        )
    
    return "Document status:\n" + "\n".join(status_summary)
```

### Waiting for Processing

Wait for documents to be fully processed before proceeding:

```python
async def process_message(self, message: str, context: AgentContext) -> str:
    await context.update_progress("Waiting for documents to be processed...", 0.1)
    
    try:
        # Wait up to 5 minutes for all documents to be ready
        statuses = await context.document_client.wait_for_ready(timeout=300.0)
        
        ready_count = sum(1 for s in statuses if s.is_ready)
        total_count = len(statuses)
        
        await context.update_progress(f"Documents ready: {ready_count}/{total_count}", 0.5)
        
        # Process documents that are ready
        for status in statuses:
            if status.is_ready:
                extracted = await context.document_client.download_extracted(status.document_id)
                # Process the extracted text...
        
        return f"Successfully processed {ready_count} documents"
        
    except TimeoutError:
        return "Some documents are still processing. Please try again later."
```

## Complete Example: Document Analysis Agent

Here's a comprehensive example that combines multiple document operations:

```python
from health_universe_a2a import Agent, AgentContext
import json

class DocumentAnalyzer(Agent):
    def get_agent_name(self) -> str:
        return "Document Analyzer"
    
    def get_agent_description(self) -> str:
        return "Analyzes uploaded documents and generates insights"
    
    async def process_message(self, message: str, context: AgentContext) -> str:
        await context.update_progress("Starting document analysis...", 0.1)
        
        # List source documents
        docs = await context.document_client.list_documents(role="source")
        if not docs:
            return "No documents found to analyze"
        
        await context.update_progress(f"Found {len(docs)} documents to analyze", 0.2)
        
        analysis_results = {
            "documents_analyzed": len(docs),
            "findings": [],
            "summary": ""
        }
        
        # Analyze each document
        for i, doc in enumerate(docs):
            progress = 0.2 + (0.6 * (i + 1) / len(docs))
            await context.update_progress(f"Analyzing {doc.name}...", progress)
            
            try:
                # Get extracted text
                content = await context.document_client.download_extracted(doc.id)
                
                # Perform analysis (simplified)
                word_count = len(content.split())
                char_count = len(content)
                
                analysis_results["findings"].append({
                    "document": doc.name,
                    "word_count": word_count,
                    "char_count": char_count,
                    "content_preview": content[:200] + "..." if len(content) > 200 else content
                })
                
            except ValueError:
                # Document not ready or extraction failed
                analysis_results["findings"].append({
                    "document": doc.name,
                    "status": "extraction_not_available"
                })
        
        # Generate summary
        total_words = sum(f.get("word_count", 0) for f in analysis_results["findings"])
        analysis_results["summary"] = f"Analyzed {len(docs)} documents with {total_words} total words"
        
        await context.update_progress("Saving analysis results...", 0.9)
        
        # Save results as a new document
        await context.document_client.write(
            "Document Analysis Results",
            json.dumps(analysis_results, indent=2),
            filename="analysis_results.json"
        )
        
        await context.update_progress("Analysis complete!", 1.0)
        
        return f"Analysis complete! Processed {len(docs)} documents. Results saved to analysis_results.json."

if __name__ == "__main__":
    agent = DocumentAnalyzer()
    agent.serve()
```

## Best Practices

1. **Error Handling**: Always handle cases where documents might not be ready or extraction fails:

   ```python
   try:
       content = await context.document_client.download_extracted(doc.id)
   except ValueError as e:
       print(f"Could not extract text: {e}")
       # Fall back to raw content or skip
   ```
2. **Progress Updates**: Keep users informed during long document processing:

   ```python
   for i, doc in enumerate(docs):
       await context.update_progress(f"Processing {doc.name}...", i / len(docs))
   ```
3. **Resource Management**: The document client is automatically managed, but you can close it explicitly if needed:

   ```python
   # Usually not needed - handled automatically
   await context.document_client.close()
   ```
4. **Check Document Types**: Verify document types before processing:

   ```python
   if doc.filename.endswith('.pdf'):
       # Use extracted text for PDFs
       content = await context.document_client.download_extracted(doc.id)
   elif doc.filename.endswith('.csv'):
       # Read CSV directly
       content = await context.document_client.download_text(doc.id)
   ```

The document client provides a comprehensive interface for working with files in Health Universe. Whether you're building document processors, analysis tools, or data extraction agents, these operations give you the flexibility to handle various document workflows efficiently.
