Hosted onifebitcoin.orgvia theHypermedia Protocol

The documentToText function converts hypermedia documents (with all their embeds) to plain text representation. It recursively resolves all inline and block embeds, replacing them with their actual content.

Overview

  • Location: @shm/shared/document-to-text

  • Purpose: Generate plain text version of documents with resolved embeds

  • Use Cases:

    • Text fragment rendering with inline embeds resolved

    • Document search indexing

    • Export to plain text

    • Content preview generation

API

Function Signature

async function documentToText({
  documentId,
  grpcClient,
  options = {},
}: {
  documentId: UnpackedHypermediaId
  grpcClient: GRPCClient
  options: DocumentToTextOptions
}): Promise<string>

Options

interface DocumentToTextOptions {
  maxDepth?: number              // Maximum embed depth (default: 10)
  resolveInlineEmbeds?: boolean  // Replace inline embeds with doc names (default: true)
  lineBreaks?: boolean           // Add line breaks between blocks (default: true)
}

Features

1. Hierarchical Block Processing

Processes document blocks depth-first:

  • Paragraphs: Extract text content

  • Headings: Extract heading text

  • Code blocks: Include code content

  • Buttons: Extract button labels from attributes.name

  • Images/Videos/Files: Include captions

  • Embeds: Recursively fetch and include content

2. Inline Embed Resolution

Replaces invisible character markers (U+FEFF) with document names:

  • Detects inline embed annotations

  • Fetches referenced document

  • Replaces marker with @DocumentName

Example:

"Check out this post!" → "Check out @Alice's Guide this post!"

3. Block Embed Resolution

Recursively fetches and includes embedded documents:

  • Full document embeds

  • Block-specific embeds (blockRef)

  • Block range embeds (blockRef with range)

4. Fragment Support

Handles blockRef and blockRange:

  • #blockId - Returns only that block's content

  • #blockId[start:end] - Returns only children within range

  • Respects parent-child relationships

5. Safety Features

  • Circular reference detection: Tracks visited documents

  • Depth limiting: Prevents infinite recursion

  • Error handling: Graceful fallbacks for missing content

Cross-Platform Integration

The function is available in both desktop and web apps through the document content context:

Desktop App

Direct access to grpcClient:

const {getDocumentText} = useDocContentContext()

const text = await getDocumentText(documentId, {
  lineBreaks: false,
  resolveInlineEmbeds: true,
})

Web App

API endpoint at /hm/api/document-text:

const {getDocumentText} = useDocContentContext()

// Same API, but fetches from server
const text = await getDocumentText(documentId, {
  maxDepth: 5,
  resolveInlineEmbeds: true,
})

Implementation Details

Architecture

Desktop:
  Component → useDocContentContext() → documentToText(grpcClient) → Text

Web:
  Component → useDocContentContext() → API /hm/api/document-text → documentToText(grpcClient) → Text

Key Files

  • frontend/packages/shared/src/document-to-text.ts - Core implementation

  • frontend/packages/shared/src/document-content-types.ts - Context interface

  • frontend/apps/desktop/src/pages/document-content-provider.tsx - Desktop provider

  • frontend/apps/web/app/doc-content-provider.tsx - Web provider

  • frontend/apps/web/app/routes/hm.api.document-text.tsx - Web API endpoint

Usage Examples

Basic Usage

import {documentToText, hmId} from '@shm/shared'
import {grpcClient} from './grpc-client'

const documentId = hmId('account123', {path: ['my-doc']})
const text = await documentToText({
  documentId,
  grpcClient,
  options: {},
})
console.log(text)

With Options

// Compact text without line breaks
const compactText = await documentToText({
  documentId,
  grpcClient,
  options: {
    lineBreaks: false,
    maxDepth: 5,
    resolveInlineEmbeds: true,
  },
})

// Without inline embed resolution (keep original text)
const rawText = await documentToText({
  documentId,
  grpcClient,
  options: {
    resolveInlineEmbeds: false,
  },
})

In React Components

function MyComponent({docId}: {docId: UnpackedHypermediaId}) {
  const {getDocumentText} = useDocContentContext()
  const [text, setText] = useState('')

  useEffect(() => {
    getDocumentText?.(docId, {lineBreaks: false})
      .then(setText)
      .catch(console.error)
  }, [docId, getDocumentText])

  return <pre>{text}</pre>
}

Testing

17 comprehensive tests covering:

  • Basic text extraction

  • Inline embed resolution

  • Block embed processing

  • Nested structures

  • Circular reference detection

  • Max depth handling

  • BlockRef/BlockRange fragments

  • Button and heading extraction

  • LineBreaks option

Run tests:

NODE_ENV=test yarn workspace @shm/shared test run document-to-text

Performance Considerations

  • Caching: Consider caching results for frequently accessed documents

  • Depth limiting: Use maxDepth option for large document trees

  • Inline embeds: Disabling resolveInlineEmbeds improves performance

  • Async: Function is async and may take time for deep embed trees

Related Documentation

  • Text Fragment Rendering

  • Document Blocks

  • Document Linking

Do you like what you are reading? Subscribe to receive updates.

Unsubscribe anytime