Files

Peter.Morton bf6f2eebd6 Added new feature for document export, including API contracts, data model, implementation plan, and tests. Updated related configurations and instructions.

2026-03-10 16:25:09 -05:00

10 KiB

Raw Permalink Blame History

Technical Research: Document Export API Route

Feature: 002-document-export
Date: 2026-03-09
Purpose: Research technical patterns and best practices for implementing Google Drive document export functionality

Research Areas

1. Google Drive Files.get API - Metadata Retrieval

Decision: Use Google Drive API v3 files.get endpoint with specific field selection

Rationale:

Google Drive API v3 provides files.get endpoint: GET https://www.googleapis.com/drive/v3/files/{fileId}
Field selection via fields query parameter reduces response size and improves performance
Required fields: id,name,mimeType,exportLinks
exportLinks returns map of available export formats for Google Workspace documents
Native files (PDFs, images) don't have exportLinks field

Implementation Pattern:

// In proxy.js - Google Drive API call
const metadataUrl = `https://www.googleapis.com/drive/v3/files/${documentId}`;
const params = {
  fields: 'id,name,mimeType,exportLinks',
  supportsAllDrives: true  // Support shared drives
};
const response = await axios.get(metadataUrl, {
  params,
  headers: { Authorization: `Bearer ${accessToken}` }
});

Alternatives Considered:

files.export endpoint directly → Rejected: Requires knowing export format upfront, can't query available formats
files.list with query → Rejected: Less efficient, requires additional parsing

References:

Google Drive API v3 Files.get: https://developers.google.com/drive/api/reference/rest/v3/files/get
Field selection: https://developers.google.com/drive/api/guides/fields-parameter

2. Export Format Selection Strategy

Decision: Priority-based format selection (Markdown > HTML > PDF) with fallback to native file streaming

Rationale:

Google Workspace documents (Docs, Sheets, Slides) provide exportLinks map: {"text/plain": "url", "text/html": "url", ...}
Markdown (text/x-markdown) is most portable for downstream content processing
HTML fallback provides rich formatting when Markdown unavailable
PDF fallback ensures something is always available
Native PDFs streamed directly using files.get with alt=media parameter

Implementation Pattern:

// Format priority order
const EXPORT_FORMATS = [
  { mimeType: 'text/x-markdown', extension: 'md' },
  { mimeType: 'text/html', extension: 'html' },
  { mimeType: 'application/pdf', extension: 'pdf' }
];

// Selection logic
function selectExportFormat(exportLinks) {
  for (const format of EXPORT_FORMATS) {
    if (exportLinks && exportLinks[format.mimeType]) {
      return {
        url: exportLinks[format.mimeType],
        contentType: format.mimeType,
        extension: format.extension
      };
    }
  }
  return null;  // No export links available
}

Alternatives Considered:

User-specified format via query parameter → Rejected: Out of scope per spec, adds complexity
Always export as PDF → Rejected: Markdown preferred for content processing
Try all formats in parallel → Rejected: Unnecessary, increases API calls

3. Native PDF File Streaming

Decision: Use Google Drive API files.get with alt=media parameter for direct file content download

Rationale:

Native PDF files (mimeType: application/pdf) don't have exportLinks
files.get with alt=media returns raw file bytes as response body
Response is streamed directly to client (no buffering in proxy)
Efficient for large files up to 10MB limit

Implementation Pattern:

// For native PDFs (no exportLinks)
if (metadata.mimeType === 'application/pdf' && !metadata.exportLinks) {
  const fileUrl = `https://www.googleapis.com/drive/v3/files/${documentId}`;
  const response = await axios.get(fileUrl, {
    params: { alt: 'media' },
    headers: { Authorization: `Bearer ${accessToken}` },
    responseType: 'stream'  // Stream response
  });
  
  // Pipe stream to client
  res.setHeader('Content-Type', 'application/pdf');
  res.setHeader('Content-Disposition', `inline; filename="${metadata.name}.pdf"`);
  response.data.pipe(res);
}

Alternatives Considered:

Buffer entire file in memory → Rejected: Inefficient for large files, increases memory usage
Download and re-upload → Rejected: Unnecessary overhead, adds latency

References:

Google Drive API files.get with alt=media: https://developers.google.com/drive/api/guides/manage-downloads

4. Content-Disposition Header Format

Decision: Use inline; filename="[name].[ext]" format for Content-Disposition header

Rationale:

inline disposition allows browser to display content (PDFs, HTML) in-browser
Filename parameter provides sensible default if user saves file
RFC 6266 compliant format
Filename includes extension matching export format (.md, .html, .pdf)

Implementation Pattern:

// Generate Content-Disposition header
function generateContentDisposition(filename, extension) {
  // Sanitize filename: remove special characters, limit length
  const sanitized = filename
    .replace(/[^a-zA-Z0-9-_. ]/g, '_')  // Replace special chars
    .substring(0, 255);  // Limit length
  
  return `inline; filename="${sanitized}.${extension}"`;
}

// Usage
res.setHeader('Content-Disposition', generateContentDisposition(metadata.name, 'md'));

Alternatives Considered:

attachment disposition → Rejected: Forces download, prevents in-browser viewing
No filename parameter → Rejected: Browser uses document ID as filename (poor UX)
RFC 2231 encoding for Unicode → Deferred: Simple ASCII sanitization sufficient for MVP

References:

RFC 6266 Content-Disposition: https://datatracker.ietf.org/doc/html/rfc6266

5. Error Handling & HTTP Status Codes

Decision: Map Google Drive API errors to appropriate HTTP status codes with descriptive messages

Rationale:

Google Drive API returns structured error responses with reason codes
Map to standard HTTP status codes for consistent client experience
Plain text error messages for simplicity (no JSON wrapper needed)

Implementation Pattern:

// Error mapping
const ERROR_MAP = {
  'notFound': { status: 404, message: 'Document not found' },
  'authError': { status: 401, message: 'Unauthorized' },
  'forbidden': { status: 401, message: 'Unauthorized' },
  'insufficientPermissions': { status: 401, message: 'Unauthorized' },
  'rateLimitExceeded': { status: 502, message: 'Bad Gateway - Google Drive API unavailable' },
  'backendError': { status: 502, message: 'Bad Gateway - Google Drive API unavailable' }
};

// Error handler
function handleDriveError(error) {
  const reason = error.response?.data?.error?.errors?.[0]?.reason;
  const mapped = ERROR_MAP[reason] || { status: 500, message: 'Export failed - unable to retrieve document content' };
  
  return {
    status: mapped.status,
    message: mapped.message
  };
}

Additional Error Scenarios:

Document > 10MB: Check Content-Length header, return HTTP 413
Timeout > 30s: Use axios timeout option, return HTTP 504
Unsupported mimetype: Check mimeType, return HTTP 403

Alternatives Considered:

JSON error responses → Rejected: Plain text simpler per spec assumptions
Retry logic → Rejected: Out of scope per spec
Detailed error messages → Rejected: Security concern, could leak internal details

6. Request Timeout & Size Limits

Decision: Implement 30-second timeout with axios and 10MB size validation via Content-Length header

Rationale:

axios supports timeout option for all requests
Content-Length header available in Google Drive API responses before streaming
Early validation prevents downloading oversized files
Timeout prevents hanging requests from blocking proxy

Implementation Pattern:

// Timeout configuration
const TIMEOUT_MS = 30000;  // 30 seconds
const MAX_SIZE_BYTES = 10 * 1024 * 1024;  // 10 MB

// Request with timeout
const response = await axios.get(url, {
  timeout: TIMEOUT_MS,
  headers: { Authorization: `Bearer ${accessToken}` }
});

// Size validation
const contentLength = parseInt(response.headers['content-length'] || '0');
if (contentLength > MAX_SIZE_BYTES) {
  return res.status(413).send('Payload Too Large');
}

Alternatives Considered:

Progressive timeout (short for metadata, long for content) → Rejected: Adds complexity, 30s sufficient
No size validation → Rejected: Could stream partial files, poor UX
Configurable limits → Rejected: Hard-coded per spec, no need for configuration

7. Streaming vs Buffering

Decision: Stream export content directly from Google Drive to client without buffering

Rationale:

axios supports streaming via responseType: 'stream'
Node.js streams allow piping directly to HTTP response
No memory overhead for file contents (only metadata buffered)
Efficient for documents approaching 10MB limit

Implementation Pattern:

// Stream response
const exportResponse = await axios.get(exportUrl, {
  headers: { Authorization: `Bearer ${accessToken}` },
  responseType: 'stream',
  timeout: TIMEOUT_MS
});

// Set headers
res.setHeader('Content-Type', contentType);
res.setHeader('Content-Disposition', contentDisposition);

// Pipe stream
exportResponse.data.pipe(res);

// Handle stream errors
exportResponse.data.on('error', (err) => {
  if (!res.headersSent) {
    res.status(500).send('Export failed - unable to retrieve document content');
  }
});

Alternatives Considered:

Buffer entire response → Rejected: Increases memory usage, adds latency
Chunked encoding → Not needed: Google Drive provides Content-Length

Summary of Technical Decisions

Area	Decision	Rationale
Metadata API	files.get with field selection	Minimal response size, single API call
Format Selection	Priority order: Markdown > HTML > PDF	Most portable to least portable
Native PDFs	files.get with alt=media streaming	Efficient, no conversion needed
Headers	Content-Disposition: inline with filename	Browser rendering + save support
Error Mapping	Google Drive errors → HTTP status codes	Consistent client experience
Timeouts	30s axios timeout	Prevents hanging requests
Size Limits	10MB via Content-Length validation	Early rejection, no partial downloads
Streaming	Direct pipe from Google Drive to client	Memory efficient, low latency

All decisions align with constitution principles (monolithic architecture, simplicity, YAGNI) and specification requirements.

10 KiB Raw Permalink Blame History

Technical Research: Document Export API Route

Research Areas

1. Google Drive Files.get API - Metadata Retrieval

2. Export Format Selection Strategy

3. Native PDF File Streaming

4. Content-Disposition Header Format

5. Error Handling & HTTP Status Codes

6. Request Timeout & Size Limits

7. Streaming vs Buffering

Summary of Technical Decisions

10 KiB

Raw Permalink Blame History