Files
google-drive-content-adapter/specs/001-sitemap/quickstart.md

13 KiB

Quickstart Guide: Google Drive HTTP Proxy Adapter

Feature: 001-drive-proxy-adapter
Date: 2026-03-07
Version: 1.0.0


Overview

The Google Drive HTTP Proxy Adapter is a Node.js application that generates XML sitemaps of Google Drive documents. It provides a single HTTP endpoint (/sitemap.xml) that queries the Google Drive API and returns a sitemap listing all accessible documents with links in RESTful format.

Key Features:

  • Service Account authentication (JWT-based, no user interaction)
  • Sitemap protocol compliant (50,000 URL limit enforced)
  • FIFO request queuing (sequential processing)
  • Configurable Drive API filters
  • Plain text logging to stdout/stderr

Prerequisites

  1. Node.js: v18.0.0 or later (LTS version recommended)
  2. Google Cloud Project: With Drive API enabled
  3. Service Account: JSON key file with Drive API access
  4. Network Access: Connectivity to googleapis.com

Installation

1. Clone Repository

git clone <repository-url>
cd google-drive-content-adapter

2. Install Dependencies

npm install

Dependencies:

  • googleapis@^140.0.0 - Official Google API client for Node.js

Configuration

1. Service Account Setup

Create Service Account (Google Cloud Console):

  1. Navigate to IAM & Admin > Service Accounts
  2. Click "Create Service Account"
  3. Name: drive-sitemap-adapter (or your choice)
  4. Grant role: None required if accessing service account's own Drive
  5. Click "Create Key" → Choose JSON format → Download key file

Enable Drive API:

  1. Navigate to APIs & Services > Library
  2. Search for "Google Drive API"
  3. Click "Enable"

Grant Access (if accessing user drives):

  • Share Drive folders/files with Service Account email (xxx@project.iam.gserviceaccount.com)
  • OR configure domain-wide delegation (for G Suite organizations)

2. Environment Variables

Create .env file in project root (or set environment variables):

# REQUIRED: Service Account credentials (inline JSON)
GOOGLE_SERVICE_ACCOUNT_KEY='{"type":"service_account","project_id":"your-project","private_key_id":"...","private_key":"-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n","client_email":"xxx@project.iam.gserviceaccount.com","client_id":"...","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_x509_cert_url":"..."}'

# OPTIONAL: Server configuration
PORT=3000                                    # Default: 3000
BASE_URL=http://localhost:3000               # Default: http://localhost:3000

# OPTIONAL: Drive API query filter
DRIVE_QUERY="trashed = false"                # Default: "trashed = false"

Important Notes:

  • GOOGLE_SERVICE_ACCOUNT_KEY must be a single-line JSON string (escape newlines in private_key)
  • BASE_URL should match your production domain for sitemap URLs
  • DRIVE_QUERY supports Drive API query syntax (docs)

3. Configuration Files

config/config.js: Server settings (auto-generated from env vars)

export default {
  server: {
    port: process.env.PORT || 3000,
    baseUrl: process.env.BASE_URL || 'http://localhost:3000'
  }
};

config/settings.js: Drive API configuration

export default {
  drive: {
    query: process.env.DRIVE_QUERY || "trashed = false",
    fields: 'files(id, name, mimeType, modifiedTime)',
    pageSize: 1000,
    scope: 'https://www.googleapis.com/auth/drive.readonly'
  }
};

To customize Drive API filter, edit config/settings.js or set DRIVE_QUERY env var.


Usage

Start Server (Development)

npm run dev

Output:

[2026-03-07T10:00:00.000Z] [INFO] Server configuration loaded: port=3000, baseUrl=http://localhost:3000
[2026-03-07T10:00:00.100Z] [INFO] Service Account authenticated: xxx***@project.iam.gserviceaccount.com
[2026-03-07T10:00:00.200Z] [INFO] HTTP server listening on port 3000

Start Server (Production)

npm start

Request Sitemap

Using curl:

curl http://localhost:3000/sitemap.xml

Expected Response (200 OK):

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://localhost:3000/documents/1A2B3C4D5E6F7G8H</loc>
    <lastmod>2026-03-07</lastmod>
  </url>
  <url>
    <loc>http://localhost:3000/documents/9I0J1K2L3M4N5O6P</loc>
    <lastmod>2026-03-05</lastmod>
  </url>
</urlset>

Testing

Run All Tests

npm test

Test Suites:

  • tests/unit/ - Unit tests for Drive client, auth, sitemap generator, queue
  • tests/integration/ - End-to-end endpoint tests for /sitemap.xml
  • tests/contract/ - XML sitemap schema validation tests

Run Specific Test Suite

npm run test:unit          # Unit tests only
npm run test:integration   # Integration tests only
npm run test:contract      # Contract tests only

API Reference

Endpoint: GET /sitemap.xml

Description: Generate XML sitemap of all accessible Google Drive documents.

Request:

GET /sitemap.xml HTTP/1.1
Host: example.com

Success Response (200 OK):

HTTP/1.1 200 OK
Content-Type: application/xml; charset=utf-8
Content-Length: {size}

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <!-- up to 50,000 URL entries -->
</urlset>

Error Responses:

  • 404 Not Found - Invalid endpoint (only /sitemap.xml supported)
  • 413 Payload Too Large - More than 50,000 documents in Drive
  • 429 Too Many Requests - Rate limit exceeded (includes Retry-After header)
  • 401 Unauthorized - Authentication failed
  • 503 Service Unavailable - Drive API unavailable
  • 500 Internal Server Error - Unexpected error

Note: All error responses have empty body (status code only).

See contracts/sitemap-xml-schema.md for full API contract.


Architecture

Project Structure

google-drive-content-adapter/
├── src/
│   ├── server.js        # HTTP server entry point
│   ├── proxy.js         # Monolithic route handler (sitemap logic)
│   ├── logger.js        # Logging module (console.js alias)
│   ├── auth.js          # Service Account JWT authentication
│   └── xml-utils.js     # XML generation utilities
├── config/
│   ├── config.js        # Server configuration (port, baseUrl)
│   └── settings.js      # Drive API filter configuration
├── tests/
│   ├── unit/            # Unit tests
│   ├── integration/     # Integration tests
│   └── contract/        # Contract tests
├── specs/               # Feature specifications and planning docs
│   └── 001-drive-proxy-adapter/
│       ├── spec.md
│       ├── plan.md
│       ├── research.md
│       ├── data-model.md
│       ├── quickstart.md (this file)
│       └── contracts/
│           └── sitemap-xml-schema.md
├── package.json
└── README.md

Request Flow

1. Client → GET /sitemap.xml
2. Server → Create RequestContext (ID, timestamp)
3. Server → Enqueue request (FIFO queue)
4. Queue → Process request (sequential, one at a time)
5. Proxy → Authenticate with Service Account JWT
6. Proxy → Query Drive API files.list() (paginate if >1000 docs)
7. Proxy → Check count ≤ 50,000
8. Proxy → Transform Documents to SitemapEntries
9. Proxy → Generate XML sitemap
10. Server → Return 200 + XML (or error status)
11. Queue → Process next request

Troubleshooting

1. Fatal Error: Invalid Service Account Credentials

Error:

[2026-03-07T10:00:00.000Z] [ERROR] FATAL: Invalid client_email in Service Account credentials

Solution:

  • Check GOOGLE_SERVICE_ACCOUNT_KEY env var is valid JSON
  • Ensure client_email field ends with .gserviceaccount.com
  • Ensure private_key field starts with -----BEGIN PRIVATE KEY-----
  • Verify no extra escaping/quotes in JSON string

2. Fatal Error: Port Already in Use

Error:

[2026-03-07T10:00:00.000Z] [ERROR] FATAL: Unable to bind to port 3000 (EADDRINUSE)

Solution:

  • Change PORT env var to different port (e.g., 8080)
  • OR stop other process using port 3000: lsof -ti:3000 | xargs kill

3. 401 Unauthorized Response

Cause: Service Account token refresh failed

Solution:

  • Verify Service Account has Drive API access (share folders with service account email)
  • Check Drive API is enabled in Google Cloud Console
  • Ensure scope is correct: https://www.googleapis.com/auth/drive.readonly

4. 413 Payload Too Large Response

Cause: Google Drive contains more than 50,000 documents

Solution:

  • Adjust DRIVE_QUERY to filter documents (e.g., by folder, date, file type)
  • Example: DRIVE_QUERY="'folder-id' in parents and trashed = false"

5. 429 Too Many Requests Response

Cause: Drive API rate limit exceeded

Solution:

  • Wait for time specified in Retry-After response header (seconds)
  • Reduce request frequency
  • Consider Drive API quota limits (docs)

6. 503 Service Unavailable Response

Cause: Google Drive API is temporarily unavailable

Solution:


Performance Tips

1. Optimize Drive Query Filter

Default (all files):

DRIVE_QUERY="trashed = false"

Filter by folder:

DRIVE_QUERY="'folder-id' in parents and trashed = false"

Filter by date:

DRIVE_QUERY="modifiedTime > '2026-01-01T00:00:00' and trashed = false"

Filter by MIME type:

DRIVE_QUERY="mimeType = 'application/pdf' and trashed = false"

See Drive API search query syntax for more options.


2. Adjust BASE_URL for Production

Development:

BASE_URL=http://localhost:3000

Production:

BASE_URL=https://your-domain.com

This ensures sitemap URLs point to the correct domain.


3. Monitor Memory Usage

Check memory usage (production):

node --inspect src/server.js
# Open chrome://inspect in Chrome DevTools

Expected: <256MB under normal load (<10 concurrent requests)


Security Best Practices

  1. Never commit Service Account JSON key file to version control
  2. Use environment variables for all sensitive configuration
  3. Restrict Service Account permissions to minimum required (readonly scope)
  4. Monitor logs for unauthorized access attempts
  5. Use HTTPS in production (configure reverse proxy like nginx)
  6. Filter credentials from logs (private_key field never logged)

Deployment

Dockerfile:

FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["npm", "start"]

Build and run:

docker build -t drive-sitemap-adapter .
docker run -p 3000:3000 \
  -e GOOGLE_SERVICE_ACCOUNT_KEY='{"type":"service_account",...}' \
  -e BASE_URL=https://your-domain.com \
  drive-sitemap-adapter

Cloud Platforms

Google Cloud Run:

gcloud run deploy drive-sitemap-adapter \
  --source . \
  --set-env-vars BASE_URL=https://your-domain.com \
  --set-secrets GOOGLE_SERVICE_ACCOUNT_KEY=service-account-key:latest

AWS ECS / Fargate: Use environment variables in task definition

Heroku: Set environment variables via Heroku CLI or dashboard


Additional Resources


Support

For issues or questions, refer to:

  1. This quickstart guide
  2. Feature specification (spec.md) for requirements
  3. Research document (research.md) for technical decisions
  4. Contract documentation (contracts/) for API details

Version History

Version Date Changes
1.0.0 2026-03-07 Initial quickstart guide