Observability Strategy

v2.2020-07-12
By following this strategy we can increase the observability of each layer of our systems and make system easier to manager
Monitoring Strategy
  Metal Server Side Code (APM) API Website Client Side Code (APM) Security
Covers Is the cloud infrastructure healthy, performing and efficient Application Performance Monitoring (APM). Server-side code instrumentation for performance and errors APIs and Blackbox. Are our underlying APIs up, performing well (globally) and returning the right data? Are web pages up, performing well (globally) and returning the right content? APM for Client Side Code Instrumentation for performance and errors Has our code or infrastructure been compromised or have vulnerabilities?
Examples
  • Databases
  • Disks
  • Compute
  • Lambda Functions
  • Networks (VPCs)
  • DB Queries
  • API Queries
  • 3rd party API invocations
  • Custom instrumentation markers
  • Errors
  • Function invocation rate/frequency
  • Lambda Invocations
  • Uptime
  • (Global) Latency
  • Accuracy
  • 3rd Party Contract/SLA Monitoring
  • Uptime
  • Latency
  • Accuracy
  • Synthetic user monitoring
  • Real user monitoring
  • Transaction monitoring
  • Web page errors
  • Web page speed
  • Mobile errors
  • Mobile speed
  • Unauthorised Access
  • Intrusion Detection
  • Compromised "Supply Chain" (libraries)
  • DDoS
Example Tools
Green -> yellow -> teal: Current implementation level
  • AWS Cloudwatch
  • CloudHealth
  • NewRelic
  • NewRelic
  • Logz.io
  • AWS Xray
  • Dashbird
  • BlackBox
  • Runscope
  • Pingdom
  • Catchpoint
  • StatusCake
  • NewRelic
  • Logz.io
  • NewRelic
  • NewRelic
  • Rollbar
  • Mobile - Firebase Monitoring
  • Mobile - Crittercism
  • Mobile - Crashlytics
  • Incapsula
  • Snyk
  • AWS GuardDuty
  • AWS Macie
  • AWS Security hub
  • Logz.io
Responsible and Accountable roles/functions (RACI)
  • Devs
  • Devs
  • Devs
  • QA
  • Service Delivery
  • QA
  • Service Delivery
  • Product Owners
  • Devs
  • QA
  • Service Delivery
  • Product Owners
  • Devs
  • QA
  • Service Delivery
Current overall maturity low to medium low to medium very low medium to high low to medium very low
Maturity criteria
What does good look like?
  1. Can you pick up infra issues ahead of time
  2. Do you have detailed load stats on underlying infra
  3. Do you have enough information to make good infra rightsizing decisions
  4. Can you spot underlying infra issues
  5. Can you easily visualise all your data
  1. Can you pick up code issues picked up ahead of time
  2. Do you have detailed stats on load and app load profiles
  3. Do you have custom StatsD type metrics to show behaviours e.g Total Articles served today
  4. Do you have detailed stats on application behaviour under load
  5. Do you have a comprehensive view on 3rd party integrations
  6. Do you have a handle on how each deployment affects application performance
  7. Can you detect runtime errors very quickly
  8. Do you have detailed info that enables you to make the right optimisations
  9. Can you easily visualise all your data
  1. Do you have a comprehensive view on 3rd party integrations
  2. Do you have a handle on how each deployment affects application performance
  3. Can you detect API errors before they ripple too far up the stack
  4. Can you quickly detect schema changes/breaks early (contract monitoring)
  5. Do you have detailed info on Global api performance
  6. Can you easily visualise all your data
  1. Can you detect, monitor and audit website uptime
  2. Do you have detailed global data on website performance
  3. Can you ensure that website content is consistently accurate
  4. Can you easily visualise all your data
  1. Can you pick up code issues picked up ahead of time
  2. Do you have detailed stats on load and app load profiles
  3. Do you have detailed stats on application behaviour under load
  4. Do you have a comprehensive view on 3rd party integrations
  5. Do you have a handle on how each deployment affects application performance
  6. Can you detect runtime errors very quickly
  7. Do you have detailed info that enables you to make the right optimisations
  8. Do you have data on user behaviours
  9. Do you have data on the platforms your users are using?
  10. Can you easily visualise all your data
  1. Can you pick up security issues ahead of time
  2. Do you get regular alerts and remedies on new vulnerabilities
  3. Do you get heuristic pickup of suspicious behaviour on your infra and apps
  4. Do you have constant data on current threat/exposure level
  5. Do you get best practice recommendations automatically
  6. Can you easily visualise all your data