Public | Automated Build

Last pushed: 10 months ago
Short Description
Das Boot!
Full Description

Run Book / System Operation Manual

Service or system overview

Service or system name: Shoelace API

Business overview

This service provides access to the configuration management of all applications deployed to live and development. Without this application bootstrap and the pipelines will not be able to access the etcd store to fetch and deposity configuration changes.

Technical overview

This is nodejsREST API that uses the etcd node client library to access the store.

Service Level Agreements (SLAs)

99.9% service availability outside of the 18:00-09:00 maintenance window

Service owner

The Malvern IO team runs and develop this service with the help of the Dev Platform team.

Contributing applications, daemons, services, middleware

nodejs + etcd client library and etcd store.

System characteristics

Hours of operation

The system is only needed to operate during the creation of new virtual machines templates and provisioning of said templates. Ideally the system should be operational outside of the 18:00-09:00 maintenance window

Hours of operation - core features

outside of the 18:00-09:00 maintenance window

Hours of operation - secondary features

outside of the 18:00-09:00 maintenance window

Data and processing flows

Data flows:

  • Virtual Machine Template creation.
    • During virtual Machine template creation bootstrap calls shoelace api to fetch system configuration for the appliance.
  • Virtual Machine creation from templates.
    • During Virtual Machine Creation from template bootstrap calls shoelace api to fetch application configuration for the appliance.

Infrastructure and network design

This application is hosted on rancher in the management cluster the app is essentially a docker application at the least three containers are running at anyone time.

Resilience, Fault Tolerance (FT) and High Availability (HA)

Currently the application load balancer in rancher is only running on one host there is a ticket to sort this out:

The docker strategy is to run multiple host with multiple load balancers running multiple containers below them so if any container/host dies they can be removed and replaced with out taking the system down.

Throttling and partial shutdown

How can the system be throttled or partially shut down e.g. to avoid flooding other dependent systems? Can the throughput be limited to (say) 100 requests per second? etc. What kind of connection back-off schemes are in place?

Throttling and partial shutdown - external requests

(e.g. Commercial API gateway allows throttling control)

Throttling and partial shutdown - internal components

(e.g. Exponential backoff on all HTTP-based services + /health healthcheck endpoints on all services)

Expected traffic and load

Details of the expected throughput/traffic: call volumes, peak periods, quiet periods. What factors drive the load: bookings, page views, number of items in Basket, etc.)

(e.g. Max: 1000 requests per second with 400 concurrent users - Friday @ 16:00 to Sunday @ 18:00, driven by likelihood of barbecue activity in the neighborhood)

Hot or peak periods

(e.g. System runs hot (89% cpu, with only 8% disk space) between 18:00 and 23:00)

Warm periods

(e.g. Our affiliate pushes rates every 4hrs, this ads 10% load.)

Cool or quiet periods

(e.g. 3am - 6am)

Environmental differences

What are the main differences between Production/Live and other environments? What kinds of things might therefore not be tested in upstream environments?

(e.g. Self-signed HTTPS certificates in Pre-Production - certificate expiry may not be detected properly in Production)
(e.g. Sinlge database in Pre-Production - clustered Database with seperate read/write instances in Production)


What tools are available to help operate the system?

(e.g. Use the script to safely cleardown the processing queue nightly)

Required resources

What compute, storage, database, metrics, logging, and scaling resources are needed? What are the minimum and expected maximum sizes (in CPU cores, RAM, GB disk space, GBit/sec, etc.)?

Required resources - compute

(e.g. Min: 4 VMs with 2 vCPU each. Max: around 40 VMs)

Required resources - storage

(e.g. Min: 10GB Azure blob storage. Max: around 500GB Azure blob storage)

Required resources - database

(e.g. Min: 500GB Standard Tier RDS. Max: around 2TB Standard Tier RDS)

Required resources - metrics

(e.g. Min: 100 metrics per node per minute. Max: around 6000 metrics per node per minute)

Required resources - logging

(e.g. Min: 60 log lines per node per minute (100KB). Max: around 6000 log lines per node per minute (1MB))

Required resources - other

(e.g. Min: 10 encryption requests per node per minute. Max: around 100 encryption requests per node per minute)

Security and access control

(e.g. Uses service account tied to AD)

Password and PII security

What kind of security is in place for passwords and Personally Identifiable Information (PII)? Are the passwords hashed with a strong hash function and salted?

(e.g. Passwords are hashed with a 10-character salt and SHA265)
(e.g. WARNING PII passed in plain text over HTTP)

Ongoing security checks

How will the system be monitored for security issues?

(e.g. External PCI scans for reported CVE issues and reports via the ABC dashboard)

System configuration

Configuration management

How is configuration managed for the system?

(e.g. CloudInit bootstraps the installation of Puppet - Puppet then drives all system and application level configuration except for the XYZ service which is configured via App.config files in Subversion)

Secrets management

How are configuration secrets managed?

(e.g. Secrets are managed with Hashicorp Vault with 3 shards for the master key)

System backup and restore

Backup requirements

Which parts of the system need to be backed up?

(e.g. Only the CoreTransactions database in PostgreSQL and the Puppet master database need to be backed up)

Backup procedures

How does backup happen? Is service affected? Should the system be [partially] shut down first?

(e.g. Backup happens from the read replica - live service is not affected)

Restore procedures

How does restore happen? Is service affected? Should the system be [partially] shut down first?

(e.g. The Booking service must be switched off before Restore happens otherwise transactions will be lost)

Monitoring and alerting

Log aggregation solution

What log aggregation & search solution will be used?

(e.g. The system will use the existng in-house ELK cluster. 2000-6000 messages per minute expected at normal load levels)
(e.g. The system also logs to disk, logrotate set for 25hrs)

Log message format

What kind of log message format will be used? Structured logging with JSON? log4j style single-line output?

(e.g. Log messages will use log4j compatible single-line format with wrapped stack traces)

Events and error messages

What significant events, state transitions and error events may be logged?

(e.g. IDs 1000-1999: Database events; IDs 2000-2999: message bus events; IDs 3000-3999: user-initiated action events; ...)


What significant metrics will be generated?

(e.g. Usual VM stats (CPU, disk, threads, etc.) + around 200 application technical metrics + around 400 user-level metrics)

Health checks

How is the health of dependencies (components and systems) assessed? How does the system report its own health?

Health of dependencies

(e.g. Use /health HTTP endpoint for internal components that expose it. Other systems and external endpoints: typically HTTP 200 but some synthetic checks for some services)

Health of service

(e.g. Provide /health HTTP endpoint: 200 --> basic health, 500 --> bad configuration + /health/deps for checking dependencies)

Operational tasks


How is the software deployed? How does roll-back happen?

(e.g. We use GoCD to coordinate deployments, triggering a Chef run pulling RPMs from the internal yaml repo)

Batch processing

What kind of batch processing takes place?

(e.g. Files are pushed via SFTP to the media server. The system processes up to 100 of these per hour on a cron schedule)

Power procedures

What needs to happen when machines are power-cycled?

(e.g. WARNING: we have not investigated this scenario yet! )
(e.g Ensure service xyz came up cleanly and is talking to the web via port 8089)

Routine and sanity checks

What kind of checks need to happen on a regular basis?

(e.g. All /health endpoints should be checked every 60secs plus the synthetic transaction checks run every 5 mins via Pingdom)
(e.g See dashboard xxx in Kibana - Red == very bad)


How should troubleshooting happen? What tools are available?

(e.g. Use a combination of the /health endpoint checks and the abc-*.sh scripts for diagnosing typical problems)
(e.g check service abc is talking to xyz and getting a 200 response)

Maintenance tasks


How should patches be deployed and tested?

Normal patch cycle

(e.g. Use the standard OS patch test cycle together with deployment via GoCD)

Zero-day vulnerabilities

(e.g. Use the early-warning notifications from UpGuard plus deployment via GoCD)
(e.g Speak to IO team!!)

Daylight-saving time changes

Is the software affected by daylight-saving time changes (both client and server)?

(e.g. Server clocks all set to UTC+0. All date/time data converted to UTC with offset before processing)
(e.g WARNING UTC confuses us terribly, we dont know what will happen!)

Data cleardown

Which data needs to be cleared down? How often? Which tools or scripts control cleardown?

(e.g. the script abc-cleardown.ps1 is run nightly via scheduled task to clear down the document cache)
(e.g. You can saftely clear /var/logs/myapp/* )

Log rotation

Is log rotation needed? How is it controlled?

(e.g. The Windows Event Log ABC Service is set to a maximum size of 512MB)

Failover and Recovery procedures

What needs to happen when parts of the system are failed over to standby systems? What needs to during recovery?


How to failover to another secondary node/site/zone if the primary fails.


How to recover the service from a failed state.

(e.g. Ingest last known good data file, restart service, run health checks.)

Troubleshooting Failover and Recovery

What tools or scripts are available to troubleshoot failover and recovery operations?

(e.g. Start with running SELECT state__desc FROM sys.database__mirroring__endpoints on the PRIMARY node and then use the scripts in the db-failover Git repo)

Docker Pull Command
Source Repository