HotelByte is a B2B hotel distribution platform that aggregates hotel supplier APIs into a unified OpenAPI interface for travel companies.

How does HotelByte handle hotel API integration?

HotelByte uses a supplier adapter framework with HTTP Dispatcher for rate limiting, connection pooling, and protocol normalization across multiple hotel suppliers.

What is the HotelByte OpenAPI?

The HotelByte OpenAPI is a ticket-based bearer token API for hotel search, rates, booking, and cancellation. Documentation is at https://openapi.hotelbyte.com.

What topics does HotelByte cover?

Hotel API aggregation, supplier connectivity, geographic search, pricing intelligence, room mapping, real-time search, and AI-native engineering.

Where are HotelByte docs?

https://openapi.hotelbyte.com for API docs; https://blog.hotelbyte.com for engineering articles.

Reading path: this is the full WP25 whitepaper. For a shorter reader-facing guide, start with the blog guide. Browse the series at HotelByte Whitepapers.

Zero-Downtime Runtime & Deployment

**HotelByte Technical Whitepaper

Version 2.0**

Executive Summary

Assumed audience: platform engineers, enterprise architects, integration owners, and technical reviewers evaluating governed engineering excellence capabilities in hotel distribution.

TL;DR: Zero downtime is a runtime discipline: graceful shutdown, connection draining, rollout evidence, and rollback readiness.

Central claim: Zero downtime is a runtime discipline: graceful shutdown, connection draining, rollout evidence, and rollback readiness.

HotelByte is a global hotel API distribution platform that processes search, availability, booking, and financial transactions across multiple continents. Any service interruption—whether planned deployment or unplanned failure—directly impacts revenue for integration partners and end travelers. To eliminate this risk, HotelByte implements a custom Master/Worker process model with signal-driven graceful restart, automated health validation, and a gated deployment pipeline that enforces verification at every stage.

This whitepaper describes the runtime architecture, deployment lifecycle, and operational controls that ensure platform updates and process recoveries occur without dropping in-flight requests or breaking active supplier sessions. It is intended for enterprise customers, security auditors, and integration partners who require transparency into HotelByte’s operational resilience and change-management posture.

Scope

This document covers the HotelByte zero-downtime runtime and deployment system:

Master/Worker process orchestration (api/master.go, api/process_mode.go)
Signal-driven graceful restart protocol (SIGUSR1, SIGTERM)
Periodic health checking and crash recovery with port reconciliation
Cross-platform process mode adaptation
Production deployment automation (build/deploy/prod.sh)
Nginx active health checks and upstream traffic management
Network adaptation and infrastructure resilience
CI/CD quality gates and canary validation

It does not cover application-level error handling, supplier adapter resilience, or data-layer failover strategies, which are addressed in separate whitepapers.

Objectives

Zero-Downtime Deployments — Replace running service instances with updated binaries without terminating active TCP connections or in-progress API requests.
Self-Healing Runtime — Detect unexpected worker process failure and automatically recover with bounded retry, port availability verification, and state reconciliation.
Verified Change Promotion — Require passing health checks, E2E validation, and version parity between staging and production before any production deployment proceeds.
Platform-Agnostic Development — Maintain identical business logic across Linux production hosts and macOS/Windows developer workstations without code branching.
Observable and Reversible — Emit structured deployment logs, maintain versioned backups, and support automated rollback when post-deployment validation fails.

Design Principles

Graceful Transitions

Every process lifecycle transition—start, restart, shutdown—is treated as a coordinated handoff rather than an abrupt state change. When a new worker process starts, the platform waits until it positively reports readiness before routing traffic to it. When an old worker is retired, it receives a termination signal with a generous grace period to complete in-flight requests. This principle ensures that transient state—open HTTP connections, active supplier sessions, in-progress bookings—never experiences hard interruption.

Fail-Safe Defaults

All safety-critical timeouts have conservative defaults. New worker readiness polling expires at 120 seconds to prevent indefinite blocking. Old worker graceful shutdown is capped at 30 seconds, after which force-termination ensures the port is released. Port availability checks after unexpected exit wait up to 30 seconds before permitting restart. These bounds guarantee that the system always converges to a known state, even under adverse conditions.

Verification at Every Gate

No change reaches production without traversing multiple independent verification layers. Static analysis (golangci-lint), change-size limits, incremental test coverage thresholds, blocking E2E tests, and UAT canary validation each act as a gate. A failure at any layer halts promotion, preventing defective code from entering the production runtime.

Platform Abstraction

The Master/Worker model is a Linux production primitive, but developers on macOS and Windows should not be burdened with emulation complexity. The platform auto-detects the operating system and defaults to Worker mode on non-Linux hosts, enabling local development with the same binary that runs in production. Explicit flags override the default when testing Master behavior locally.

Runtime Architecture

The HotelByte runtime is organized into three layers: the Master/Worker process layer, the health check layer, and the deployment layer.

Layer 1 — Master/Worker Process Layer

On Linux production hosts, the service binary starts in Master mode. The Master process is a lightweight supervisor with four responsibilities:

Worker Lifecycle Management — The Master starts the initial worker process, forwards startup arguments (including configuration overrides), and sets environment variables that identify the worker’s parent relationship.
Signal Coordination — The Master listens for SIGUSR1 (graceful restart), SIGTERM, and SIGINT (shutdown). Signal handling is platform-specific: Unix systems support the full protocol; Windows receives graceful degradation.
Graceful Restart Orchestration — Upon receiving SIGUSR1, the Master atomically starts a new worker, polls its /ready endpoint at 1-second intervals for up to 120 seconds, and only then signals the old worker to exit. This overlap ensures at least one healthy worker is always listening on the service port.
Crash Recovery — If the worker exits unexpectedly, the Master classifies the exit reason (graceful, unexpected exit code, signal kill, or wait error), waits for TCP cleanup and port release, and starts a replacement worker. Port availability is verified with active probing before binding is attempted.

The Worker process executes the full business HTTP service. It is unaware of the Master except through standard environment variables. This separation of concerns keeps the business runtime simple while the Master handles operational complexity.

Layer 2 — Health Check Layer

Health verification operates at two frequencies:

Periodic Process Health — Every 10 seconds, the Master checks whether the current worker PID is still alive in the process table. If the worker has disappeared without notifying the monitor goroutine, the Master starts a replacement immediately.
Readiness-Based Traffic Admission — During graceful restart, the new worker must respond with HTTP 200 from its /ready endpoint before the old worker receives a shutdown signal. The endpoint reports application-level health: database connectivity, cache availability, and essential background workers. Until /ready succeeds, the new worker is invisible to the Nginx upstream.

In addition, Nginx fronting the application runs the nginx_upstream_check_module with a 3-second active check interval. If a worker becomes unhealthy at the application level, Nginx removes it from the upstream pool before the Master even detects the failure, providing defense in depth.

Layer 3 — Deployment Layer

The deployment layer automates multi-server rollout with built-in safety controls:

Server Role Detection — The deployment script auto-classifies each target as load balancer, application, or database host and applies the correct configuration subset.
Pre-Deployment Backup — Before any change, the current binary and configuration are backed up with timestamped directories. SSL key material is preserved with appropriate privilege escalation. Backups are retained with automatic rotation.
Network Quality Assessment — Prior to file transfer, the script evaluates packet loss and latency to each target. Poor network conditions are flagged before they can cause partial deployment states.
Graceful Restart Trigger — On each application host, the deployment sends SIGUSR1 to the running Master process, initiating the zero-downtime handoff described above.
Post-Deployment Validation — After restart, the script probes /ready and /ping endpoints via both localhost (SSH tunnel) and external HTTP/HTTPS paths with retry logic. Failure at this stage triggers recorded failure details.
Auto-Rollback — If non-disaster-recovery nodes exceed the configured failure threshold, the deployment outcome is classified as ROLLBACK and the platform reverses to the backed-up version.
Gray Deployments — Traffic can be shifted incrementally using header-based gray routing. A gray cluster receives a configurable percentage of production traffic (e.g., 1%) before full promotion.

Deployment Lifecycle

A standard production deployment follows this lifecycle:

Pre-Flight Gates — The operator invokes make deploy-gate-check, which runs blocking E2E tests against the target environment. These tests validate critical booking flows, search aggregation, and financial transaction integrity.
UAT Canary Gate — For production promotion, make deploy-prod-gated first verifies that UAT is running the same commit intended for production. If version parity is absent, UAT is updated first. Blocking E2E tests then run against UAT as a live canary.
Backup Phase — On each production host, the current binary and configuration are archived.
Binary Propagation — The new binary is transferred to application hosts and placed under supervisor control.
Signal-Driven Restart — The deployment script sends SIGUSR1 to each Master process, triggering the graceful restart protocol.
Health Validation — The script polls /ready until HTTP 200 is returned, with a maximum wait of 120 seconds.
Nginx Reload — Load balancer nodes reload Nginx configuration to pick up any upstream changes, validated with nginx -t before application.
Outcome Classification — Results are classified as SUCCESS, PARTIAL_SUCCESS, or ROLLBACK based on failure counts across non-backup servers.

Implemented Control Summary

Control	Customer Value
Master/Worker Process Model	A lightweight supervisor ensures a healthy worker is always listening on the service port, isolating business logic from operational lifecycle concerns.
SIGUSR1 Graceful Restart	Zero-downtime binary replacement: new workers prove readiness before old workers are retired, eliminating in-flight request drops during deployments.
120-Second Readiness Polling	Conservative timeout prevents premature traffic cutover to a worker that has not finished initializing database pools or cache warm-up.
30-Second Graceful Shutdown	Old workers receive a bounded grace period to complete active requests; force-termination ensures cleanup if a worker hangs.
10-Second Periodic Health Check	The Master detects silent worker disappearance and replaces the process before Nginx upstream checks or external monitors flag the outage.
Crash Recovery with Port Reconciliation	After unexpected worker exit, the system waits for TCP cleanup and actively verifies port availability before restart, preventing bind conflicts.
Cross-Platform Process Mode	Linux defaults to Master mode for production supervision; macOS and Windows default to Worker mode for frictionless local development.
Nginx Active Health Check (`nginx_upstream_check_module`)	3-second interval upstream probing removes unhealthy workers from rotation independently of application signals, adding a network-layer safety net.
Pre-Deployment Backup	Every deployment creates timestamped, rotated backups of the binary and configuration, enabling sub-minute rollback if post-deployment validation fails.
SSH Keep-Alive & Retry	Long-running deployments use persistent SSH connections with heartbeat packets and automatic retry on transient network failures.
Network Quality Pre-Check	Packet-loss and latency assessment before file transfer prevents partial deployment states on degraded network paths.
Gray Deployment with Header Routing	New versions can receive a configurable traffic percentage (e.g., 1%) before full promotion, limiting blast radius of undetected regressions.
Auto-Rollback on Failure	When non-backup nodes exceed failure tolerance, the deployment automatically reverts to the prior version, preventing a bad release from saturating the fleet.
`golangci-lint` CI Gate	Static analysis catches common Go defects, anti-patterns, and security issues before code is eligible for deployment.
400-Line Change Limit	Enforces small, reviewable diffs that reduce cognitive load and regression risk per deployment unit.
50% Incremental Coverage Gate	Every pull request must cover at least half of its new code with tests, preventing untested logic from entering the release pipeline.
Blocking E2E Gate (`deploy-gate-check`)	Deployment is blocked until end-to-end tests pass against the target environment, ensuring core business flows remain intact.
UAT Canary Gate (`deploy-prod-gated`)	Production deployment requires UAT version parity and passing E2E validation, providing a live production-like canary before customer-facing rollout.

Auditability

External reviewers and enterprise customers can verify HotelByte’s zero-downtime controls through the following mechanisms:

Structured Deployment Logs — Every deployment emits timestamped, color-coded logs that record SSH connectivity tests, network quality metrics, backup paths, graceful restart signals, health-check results, and final outcome classification. These logs are retained for post-hoc audit.
Master Process Logs — The Master emits explicit log lines for worker start, worker exit reason classification, readiness polling status, graceful restart progress, and force-termination decisions. Reviewers can correlate deployment timestamps with Master logs to verify that no restart skipped the readiness gate.
Nginx Upstream Check Logs — When nginx_upstream_check_module is enabled, Nginx logs upstream peer additions and removals. Independent verification confirms that unhealthy workers are removed from rotation before traffic is routed to them.
Metrics Export — hotel_be_worker_exit_total and hotel_be_worker_restart_total counters are tagged by exit reason and restart action. Reviewers with metric access can validate that unexpected exits are rare and that restart attempts succeed.
CI/CD Artifact Retention — GitHub Actions retains golangci-lint reports, file-size check results, coverage artifacts (coverage.out, incremental_coverage.out), and E2E gate logs. Each production deployment is traceable to a specific commit that passed all gates.
Source Availability — The Master/Worker protocol, deployment scripts, and CI workflow definitions are stored in the repository. Reviewers can inspect exact timeout values, retry logic, and signal handling behavior.
Supervisor Integration — On production hosts, the worker process runs under supervisor control with configured stdout and stderr log paths. Process uptime and restart history are observable via standard supervisor tooling.

Authoritative Source References

Source	Original Excerpt	HotelByte Control Mapping
Google SRE Book — Chapter 16: Disaster Planning	“The goal is to reduce the frequency of unplanned downtime, and when it does occur, reduce the time to recover.”	The Master/Worker model reduces unplanned downtime through 10-second health checks and automatic crash recovery. Graceful restart reduces recovery time to seconds by overlapping old and new workers.
AWS Well-Architected Framework — Reliability Pillar (REL08-BP03)	“Deploy changes to production using immutable infrastructure or phased rollout methods (canary, linear, all-at-once).”	HotelByte uses phased rollout via gray deployments with configurable traffic ratios (1–100%), followed by full promotion only after health validation.
AWS Well-Architected Framework — Operational Excellence Pillar (OPS05-BP05)	“Make small, reversible changes that can be rolled back quickly without affecting customers.”	The 400-line change limit enforces small diffs; pre-deployment backups and auto-rollback enable sub-minute reversal.
Google SRE Book — Chapter 8: Release Engineering	“Releases should be staged, and each stage should validate the release before proceeding to the next stage.”	`deploy-gate-check` and `deploy-prod-gated` implement staged validation: lint → unit tests → E2E gate → UAT canary → production.
NIST SP 800-53 Rev. 5 CP-10 — System Recovery and Reconstitution	“Provide for the recovery and reconstitution of the information system to a known state.”	Pre-deployment backups, timestamped artifacts, and automatic rollback ensure the platform can be reconstituted to a known-good state after a failed deployment.
The Twelve-Factor App — XI. Logs	“A twelve-factor app never concerns itself with routing or storage of its output stream.”	Worker processes write logs to `stdout`/`stderr`, which the Master forwards transparently. Supervisor captures and persists these streams, keeping the application runtime decoupled from log infrastructure.
RFC 7231 — HTTP/1.1 Semantics and Content (503 Service Unavailable)	“The 503 status code indicates that the server is currently unable to handle the request due to temporary overloading or maintenance.”	The `/ready` endpoint returns 503 during startup and shutdown, signaling Nginx and load balancers to route traffic away from the worker without dropping existing connections.

Technical Whitepaper Governance Reading

Read Zero-Downtime Runtime & Deployment through the technical whitepaper governance loop: intent, evidence, bounded execution, verification, and durable governance.

Plane	What to inspect in this paper
Intent	Which operational or integration risk the design removes.
Evidence	Which logs, metrics, records, traces, tests, or replay artifacts prove the behavior.
Execution boundary	Which layer owns the decision and which layer only adapts or transports data.
Verification	Which failure modes are tested beyond the happy path.
Governance memory	Which rules, dashboards, audit trails, or test cases make the lesson reusable.

Conclusion

Zero-Downtime Runtime & Deployment matters because it turns a fragile implementation concern into a governed platform capability. The durable value is not that the component exists, but that its boundaries, evidence, failure semantics, and verification path can be reviewed after the fact.

Zero downtime is a runtime discipline: graceful shutdown, connection draining, rollout evidence, and rollback readiness.

Whitepaper: Zero-Downtime Runtime & Deployment