Document to Text Function

24 October 2025, 16:12

The documentToText function converts hypermedia documents (with all their embeds) to plain text representation. It recursively resolves all inline and block embeds, replacing them with their actual content.

Overview

Location: @shm/shared/document-to-text
Purpose: Generate plain text version of documents with resolved embeds
Use Cases:
- Text fragment rendering with inline embeds resolved
- Document search indexing
- Export to plain text
- Content preview generation

API

Function Signature

async function documentToText({
  documentId,
  grpcClient,
  options = {},
}: {
  documentId: UnpackedHypermediaId
  grpcClient: GRPCClient
  options: DocumentToTextOptions
}): Promise<string>

Options

interface DocumentToTextOptions {
  maxDepth?: number              // Maximum embed depth (default: 10)
  resolveInlineEmbeds?: boolean  // Replace inline embeds with doc names (default: true)
  lineBreaks?: boolean           // Add line breaks between blocks (default: true)
}

Features

1. Hierarchical Block Processing

Processes document blocks depth-first:

Paragraphs: Extract text content
Headings: Extract heading text
Code blocks: Include code content
Buttons: Extract button labels from attributes.name
Images/Videos/Files: Include captions
Embeds: Recursively fetch and include content

2. Inline Embed Resolution

Replaces invisible character markers (U+FEFF) with document names:

Detects inline embed annotations
Fetches referenced document
Replaces marker with @DocumentName

Example:

"Check out this post!" → "Check out @Alice's Guide this post!"

3. Block Embed Resolution

Recursively fetches and includes embedded documents:

Full document embeds
Block-specific embeds (blockRef)
Block range embeds (blockRef with range)

4. Fragment Support

Handles blockRef and blockRange:

#blockId - Returns only that block's content
#blockId[start:end] - Returns only children within range
Respects parent-child relationships

5. Safety Features

Circular reference detection: Tracks visited documents
Depth limiting: Prevents infinite recursion
Error handling: Graceful fallbacks for missing content

Cross-Platform Integration

The function is available in both desktop and web apps through the document content context:

Desktop App

Direct access to grpcClient:

const {getDocumentText} = useDocContentContext()

const text = await getDocumentText(documentId, {
  lineBreaks: false,
  resolveInlineEmbeds: true,
})

Web App

API endpoint at /hm/api/document-text:

const {getDocumentText} = useDocContentContext()

// Same API, but fetches from server
const text = await getDocumentText(documentId, {
  maxDepth: 5,
  resolveInlineEmbeds: true,
})

Implementation Details

Architecture

Desktop:
  Component → useDocContentContext() → documentToText(grpcClient) → Text

Web:
  Component → useDocContentContext() → API /hm/api/document-text → documentToText(grpcClient) → Text

Key Files

frontend/packages/shared/src/document-to-text.ts - Core implementation
frontend/packages/shared/src/document-content-types.ts - Context interface
frontend/apps/desktop/src/pages/document-content-provider.tsx - Desktop provider
frontend/apps/web/app/doc-content-provider.tsx - Web provider
frontend/apps/web/app/routes/hm.api.document-text.tsx - Web API endpoint

Usage Examples

Basic Usage

import {documentToText, hmId} from '@shm/shared'
import {grpcClient} from './grpc-client'

const documentId = hmId('account123', {path: ['my-doc']})
const text = await documentToText({
  documentId,
  grpcClient,
  options: {},
})
console.log(text)

With Options

// Compact text without line breaks
const compactText = await documentToText({
  documentId,
  grpcClient,
  options: {
    lineBreaks: false,
    maxDepth: 5,
    resolveInlineEmbeds: true,
  },
})

// Without inline embed resolution (keep original text)
const rawText = await documentToText({
  documentId,
  grpcClient,
  options: {
    resolveInlineEmbeds: false,
  },
})

In React Components

function MyComponent({docId}: {docId: UnpackedHypermediaId}) {
  const {getDocumentText} = useDocContentContext()
  const [text, setText] = useState('')

  useEffect(() => {
    getDocumentText?.(docId, {lineBreaks: false})
      .then(setText)
      .catch(console.error)
  }, [docId, getDocumentText])

  return <pre>{text}</pre>
}

Testing

17 comprehensive tests covering:

Basic text extraction
Inline embed resolution
Block embed processing
Nested structures
Circular reference detection
Max depth handling
BlockRef/BlockRange fragments
Button and heading extraction
LineBreaks option

Run tests:

NODE_ENV=test yarn workspace @shm/shared test run document-to-text

Performance Considerations

Caching: Consider caching results for frequently accessed documents
Depth limiting: Use maxDepth option for large document trees
Inline embeds: Disabling resolveInlineEmbeds improves performance
Async: Function is async and may take time for deep embed trees