Seb’s IT blog

Applying changes to GitHub repositories in bulk

2024-04-21T00:00:00+00:00

Getting started

Disclaimer: All steps done with microplane 0.0.34

In my current project we’re dealing with a +300 member GitHub organization that has +120 repositories. We often deal with platform-related changes (think of deprecations or new features) that we need to roll out to our repositories in bulk.

Recently I came across microplane which wraps these repetitive tasks in an intuitive way. Install it as follows:

brew install microplane

On a high level the workflow of microplane consists of these steps:

Init - target the repos you want to change
Clone - clone the repos you just targeted
Plan - run a script against each of the repos and preview the diff
Push - commit, push, and open a Pull Request
Merge - merge the PRs

Issue description

The sections below are based on a real world example: we have countless GitHub workflows using various runner labels. One of those labels is deprecated so we need to replace it.

This can be broken down into two specific steps:

remove the deprecated label contoso-runners from the list of allowed labels in the $REPO_ROOT/actionlint.yaml
replace the label contoso-runners with contoso-x86-2core in all workflows under $REPO_ROOT/.github/workflows/*.{yaml,yml}

If we were to script these changes for a single repo it could look like this run-label-update.sh:

#!/bin/bash

# Remove deprecated runner label from actionlint.yaml
grep -v -E "^\s+- contoso-runners\s+" actionlint.yaml > actionlint.yaml.new
mv actionlint.yaml.new actionlint.yaml

# Replace the deprecated label in workflows
find .github -type f -name '*.yaml' -o -name '*.yml' -exec sed -i '' -E 's/runs-on: \[?contoso-runners\]?\s*$/runs-on: [contoso-x86-2core]/g' {} \;

Init and clone

In the past our workflow often looked like this:

# One-time setup of github-cli for GHES
gh auth login --hostname git.contoso.com --with-token < "$(bw get password 'd8df5b3f-14e1-455d-88cb-b08100af4485')"

# Bulk clone repos (in chunks of 10 at a time):
gh repo list my-cool-org --no-archived -L 500 --json sshUrl --jq '.[].sshUrl' | sort > repos.txt
mkdir -p repos && cd $_
cat ../repos.txt | xargs -n 1 -P 10 git clone

With microplane this looks like:

# This should be a GitHub Token with repo scope.
export GITHUB_API_TOKEN="$(bw get password 'd8df5b3f-14e1-455d-88cb-b08100af4485')"
mp init -f repos.txt --provider-url https://git.contoso.com
mp clone

Where repos.txt just contains a list of org/repo entries. Alternatively microplane can also use a GitHub search filter (code or repo search) - here are some examples:

# 1a) Code search: Get repos where the Dependabot config contains "maven"
mp init "org:my-cool-org filename:dependabot.yml maven" --provider-url https://git.contoso.com

# 1b) Code search: Get repos where "GatewayService" appears in the path or contents of a yaml file in an overlays directory
mp init "org:my-cool-org GatewayService in:file,path overlays language:yaml" --provider-url https://git.contoso.com

# 1c) Repo search: Get unarchived repos based on topics
# In our case we have repository topics such as `team-{team-handle}`, `backend` and `java`.
# An example query to get all repos from team "Star Chaser" for backend projects that are not based on Java would look like:
mp init "org:my-cool-org archived:false topic:team-star-chaser topic:backend -topic:java" --repo-search --provider-url https://git.contoso.com

# 2) Clone repo(s)
mp clone

Note: As of this writing the code search filters is:archived and -is:archived are not supported on GHES 3.11.2 and below: https://github.com/orgs/community/discussions/8591

Plan

Calling mp plan normally will run a specifed command against all cloned repositories. The docs provide these example commands:

mp plan -b microplaning -m 'microplane fun' -r app-service -- sh -c /absolute/path/to/script
mp plan -b microplaning -m 'microplane fun' -r app-service -- python /absolute/path/to/script

In our case we’d run it like this:

▶ mp plan \
  --branch feature/JIRA-123-replace-deprecated-runner-label \
  --message "feat(JIRA-123): replace deprecated runner label" \
  -- sh -c "$PWD/run-label-update.sh"
2024/04/21 16:24:11 planning 16 repos with parallelism limit [10]
2024/04/21 16:24:11 planning: my-cool-org/repo1
2024/04/21 16:24:11 planning: my-cool-org/repo2
2024/04/21 16:24:11 planning: my-cool-org/repo3
2024/04/21 16:24:11 planning: my-cool-org/repo4
2024/04/21 16:24:11 planning: my-cool-org/repo5
2024/04/21 16:24:11 planning: my-cool-org/repo6
2024/04/21 16:24:11 planning: my-cool-org/repo7
2024/04/21 16:24:11 planning: my-cool-org/repo8
2024/04/21 16:24:11 planning: my-cool-org/repo9
2024/04/21 16:24:11 planning: my-cool-org/repo10
2024/04/21 16:24:11 planning: my-cool-org/repo11
2024/04/21 16:24:11 planning: my-cool-org/repo12
2024/04/21 16:24:11 planning: my-cool-org/repo13
2024/04/21 16:24:11 planning: my-cool-org/repo14
2024/04/21 16:24:11 planning: my-cool-org/repo15
2024/04/21 16:24:11 planning: my-cool-org/repo16
2024/04/21 16:24:14 2 errors:
 multiple errors: my-cool-org/repo5
nothing to commit, working tree clean
 | my-cool-org/repo14
nothing to commit, working tree clean

You can see that an error was reported for two repos - they already match the desired state so there is nothing to do. This can be validated with the mp status command at any time:

▶ mp status
REPO      STATUS    DETAILS
repo1     planned   0 file(s) modified
repo2     planned   0 file(s) modified
repo3     planned   0 file(s) modified
repo4     planned   0 file(s) modified
repo5     cloned    (plan error) On branch feature/JIRA-123-replace-deprecated-runner-label nothing to commit, working tree clean
repo6     planned   0 file(s) modified
repo7     planned   0 file(s) modified
repo8     planned   0 file(s) modified
repo9     planned   0 file(s) modified
repo10    planned   0 file(s) modified
repo11    planned   0 file(s) modified
repo12    planned   0 file(s) modified
repo13    planned   0 file(s) modified
repo14    cloned    (plan error) On branch feature/JIRA-123-replace-deprecated-runner-label nothing to commit, working tree clean
repo15    planned   0 file(s) modified
repo16    planned   0 file(s) modified

microplane’s plan feature is similar to terraform in that it can show you what will be changed - simply pass the --diff parameter:

  mp plan \
    --branch feature/JIRA-123-replace-deprecated-runner-label \
    --message "feat(JIRA-123): replace deprecated runner label" \
    -- sh -c "$PWD/run-label-update.sh"
 [...output omitted...]
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 9dcf9d55..193870de 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -36,7 +36,7 @@ jobs:
           FILTER_REGEX_EXCLUDE: LICENSE.md

   build:
-    runs-on: [contoso-runners]
+    runs-on: [contoso-x86-2core]
     outputs:
       image-tag: $
       image-sha: $
diff --git a/actionlint.yaml b/actionlint.yaml
index 278da045..c8492f36 100644
--- a/actionlint.yaml
+++ b/actionlint.yaml
@@ -2,7 +2,6 @@
 # https://pages.git.contoso.com/techdocs/guidelines/tools/gh-actions/available-runners/
 self-hosted-runner:
   labels:
-    - contoso-runners         # x86_64  4 core  8GB  On demand
     - contoso-x86-4core       # x86_64  4 core  8GB  On demand
     - contoso-x86-2core       # x86_64  2 core  4GB  On demand
     - contoso-x86-1core       # x86_64  1 core  2GB  On demand

2024/04/21 19:16:31 2 errors:
 multiple errors: my-cool-org/repo5 error: On branch feature/JIRA-123-replace-deprecated-runner-label
nothing to commit, working tree clean
 | my-cool-org/repo14 error: On branch feature/JIRA-123-replace-deprecated-runner-label
nothing to commit, working tree clean

You can also chose a specific repo from the cloned repos list:

▶ mp plan \
    --branch feature/JIRA-123-replace-deprecated-runner-label \
    --message "feat(JIRA-123): replace deprecated runner label" \
    --repo my-cool-org/repo4 \
    -- sh -c "$PWD/run-label-update.sh"

The folder structure will look like this:

├── mp
│   ├── init.json
│   ├── repo1
│   │   ├── clone
│   │   │   ├── clone.json
│   │   │   └── cloned
│   │   │       └── 
│   │   ├── plan
│   │   │   ├── plan.json
│   │   │   └── planned
│   │   │       └── 
│   │   └── push
│   │       └── push.json
│   ├── repo2
│   └── repoN
├── run-label-update.sh
└── repos.txt

Push

We can now create PRs for our changes with the mp push command - helpful optional parameters are:

  -a, --assignee string    Github user to assign the PR to
  -b, --body-file string   body of PR
  -d, --draft              push a draft pull request (only supported for github)
  -h, --help               help for push
  -l, --labels strings     labels to attach to PR

In my case it’s sufficient to run mp push command as the reviewers will be configured based on the repo’s CODEOWNERS file.

The PRs can now be reviewed:

Merge

There is also a merge subcommand for microplane, that supports the following params but in our case we value the manual aspect of PR reviews:

  -h, --help                     help for merge
      --ignore-build-status      Ignore whether or not builds are passing
      --ignore-review-approval   Ignore whether or not the review has been approved
  -m, --merge-method string      Merge method to use. Possible values include: merge, squash, rebase (default "merge")
  -t, --throttle string          Throttle number of merges, e.g. '30s' means 1 merge per 30 seconds (default "30s")

It also inherits the --repo param from its parent command:

  -r, --repo string   single repo to operate on

Cheat Sheet - github-cli

2024-04-14T00:00:00+00:00

Authenticate with GitHub Enterprise

gh auth login --hostname git.contoso.com --with-token < $(bw get password 'd8df5b3f-14e1-455d-88cb-b08100af4485')

Repo bulk operations

Note: This uses gh’s internal jq library, rather than a standalone jq.
Note: We move archived repos to a dedicated org to prevent our search results from getting cluttered up with old code.

# Bulk clone repos:
gh repo list $org --no-archived -L 500 --json sshUrl --jq '.[].sshUrl' | sort > repos.txt
cat repos.txt | xargs -I % git clone %

# Bulk unarchive repos:
gh repo list $org --archived -L 500 --json name --jq '.[].name' | sort > repos.txt
repos.txt | xargs -I % gh repo unarchive $org/% --yes

# Bulk archive repos:
# We use topics to cluster archived repos by scope and source org 
gh repo list $org --no-archived -L 500 --json name --jq '.[].name' | sort > repos.txt
cat repos.txt | xargs -I % gh repo edit $org/% --add-topic scope-foo --add-topic "src-org-$org"
cat repos.txt | xargs -I % gh api repos/$org/%/transfer -f new_owner="$org-archive"

Tip: Faster cloning for many repos

If you are cloning a large number of repos, you can speed up this script with GNU parallel.

gh repo list  --limit  --json sshUrl --jq '.[].sshUrl' | \
  parallel -j git clone

Alternatively you can also use xargs:

gh repo list  --limit  --json sshUrl --jq '.[].sshUrl' | \
  xargs -n 1 -P 8 git clone

Which breaks down like this:

taking at most one argument per run command line
and run up to eight processes at a time

JSON queries

# print only specific fields from the response
gh api repos/{owner}/{repo}/issues --jq '.[].title'

# Override the GH pager...
export GH_PAGER=cat

# ... or unset PAGER
PAGER= gh release view v1.0.3 --json name,isDraft

PRs

# Create a PR
gh pr create --title "feat: my_super_feature" --body "all the details"

# View the list of open pull requests
gh pr list

# Check CI for a PR
gh pr checks 

Workflow

# List workflow files in GitHub Actions
gh workflow list

# Run the workflow file 'build.yml' at the remote default branch.
gh workflow run build.yml

# Select a workflow to view interactively
gh workflow view

# Enable/Disable a workflow
gh workflow enable/disable build.yml

# List recent workflow runs
gh run list

# Select a run to view interactively
gh run view

# Watch a run until it completes
gh run watch 3456

# Cancel a workflow run
gh run cancel 3456

# Rerun a failed run
gh run rerun 3456

Misc.

gh repo edit --visibility 

# View Current GitHub repository in the web browser
gh browse

# Open repository settings
gh browse --settings

# Open issues or pull request
gh browse 

# Open a commit page.
gh browse 

# Open main.py file at line 340.
gh browse main.py:340

Quickly change default applications on MacOS

2023-08-29T00:00:00+00:00

Apple tends to design its applications for the average users making life needlessly hard for power users. Sometimes you just want to change default applications via the command-line instead of having to sift through various sub menus of the system settings.

Installation

This can easily be achieved with SwiftDefaultApps. Simply install it via brew:

brew install swiftdefaultappsprefpane

In case you get the following error:

“swda” cannot be opened because the developer cannot be verified.
macOS cannot verify that this app is free from malware.
Move to Bin / Cancel

The only solution that worked for me is to delete the com.apple.quarantine extended attribute from the binary. This metadata attribute is set on files by applications like browsers that download files over the network.

xattr -d com.apple.quarantine "$(which swda)"

Note: this generally also works for apps in which case you also need to specify -r to apply the change recursively, e.g.:

xattr -r -d com.apple.quarantine /Applications/ChatGPT.app

Examples

# Get configured default internet browser
▶ swda getHandler --internet
/Applications/Google Chrome.app

# Configure default internet browser (by path or ID)
## Edge
▶ swda setHandler --app /Applications/Microsoft\ Edge.app --internet
▶ swda setHandler --app com.microsoft.edgemac --internet

## Chrome
▶ swda setHandler --app /Applications/Google\ Chrome.app --internet
▶ swda setHandler --app com.google.Chrome  --internet

## Chrome
▶ swda setHandler --app /Applications/Safari.app --internet
▶ swda setHandler --app com.apple.Safari --internet

## Firefox
▶ swda setHandler --app /Applications/Firefox.app --internet
▶ swda setHandler --app org.mozilla.firefox --internet


# Change the tel handler (default: /System/Applications/FaceTime.app)
▶ swda setHandler --URL tel --app /Applications/RingCentral.app

Command to hibernate Windows

2023-07-21T00:00:00+00:00

The command requires hibernation to be enabled (can be done with powercfg -h on).

Simply run the following to enter hibernation:

rundll32.exe powrprof.dll,SetSuspendState 0,1,0

Backing up Bitwarden including attachments

2023-04-11T00:00:00+00:00

Sometimes you’ll need to export Bitwarden data, for example when handing over a project. This can be done for an org as follows:

# Create a json export that can be imported into other systems
bw export --format json --output myorg-bitwarden-export.json --organizationid eba3880c-370e-48f9-880c-90bb67f57b1b

# Create json export that contains additional information about attachments
bw list items --organizationid eba3880c-370e-48f9-880c-90bb67f57b1b > myorg-bitwarden-itemlist.json

# Export those attachments into folders that match the BW item items
bash <(jq -r '.[] | select(.attachments != null) | . as $parent | .attachments[] | "bw get attachment \(.id) --itemid \($parent.id) --output \"./attachments/\($parent.id)/\(.fileName)\""' myorg-bitwarden-itemlist.json)

There also seems to be a community tool called PortWarden that supports backing up and restoring BW items:
https://github.com/vwxyzjn/portwarden

References:

https://www.reddit.com/r/Bitwarden/comments/eevuoc/how_to_perform_a_bitwarden_export_with/

Cheat Sheet - Prometheus

2022-02-27T00:00:00+00:00

USE Method in Prometheus

CPU Utilisation

1 - avg(rate(node_cpu{job="default/node-exporter",mode="idle"}[1m]))
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m]))

CPU Saturation

sum(node_load1{job="default/node-exporter"})
  /
sum(node:node_num_cpu:sum)

With the following recording rule:

    - expr: |-
        count by (instance) (sum by (instance, cpu) (
          node_cpu_seconds_total{job="node-exporter"}
        ))
      record: node:node_num_cpu:sum

Memory Utilisation

1 - sum(
  node_memory_MemFree_bytes{job="…"} +
  node_memory_Cached_bytes{job="…"} +
  node_memory_Buffers_bytes{job="…"}
)
  /
sum(node_memory_MemTotal_bytes{job="…"})

Memory Saturation

1e3 = 1000 (it is sometimes used to convert from milliseconds to seconds).

1e3 * sum(
  rate(node_vmstat_pgpgin{job="…"}[1m]) +
  rate(node_vmstat_pgpgout{job="…"}[1m]))
)

RED Method in Prometheus

Rate (QPS)

sum(rate(request_duration_seconds_count{job="…"}[1m]))

Errors

sum(rate(request_duration_seconds_count{job="…", status_code!~"2.."}[1m]))

Duration (Latency)

histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{job="…"}[1m])) by (le))

Four Golden Signals

Latency - time taken to service a request
Traffic - how much demand is placed on your system
Errors - rate or requests are failing
Saturation - how “full” your service is

Misc.

Instant Query = queries only the last datapoint best displayed in table panel

Grouping

count(device_boot_time* on (instance) group_left(region, firmware)  device_info{region="$region", firmware=~"$firmware"})

Here, device_info is the dummy gauge, always just set to one. You can see that we are selecting this gauge based on the Grafana template values.

These two labels are used to group_left join on other metrics, for which we want (on) the instance. The metric is device_boot_time but in this case the value is not relevant for the metric, as all we are doing using count to count all devices that have a boot time greater than 0 (which is all devices).

This gives a count of all of the devices in the region that match the region and firmware.

In order to get the All option to work for firmware, a few things are needed:

Select All in the dashboard’s variable settings
Set a custom value for the option of .*
Use the pattern match operator in the info-block selector of =~

See https://github.com/GoogleCloudPlatform/community/blob/100c6bf74a5d67d174da1957fe3267d83426f658/tutorials/cloud-iot-prometheus-monitoring/dashboards/Region%20Dashboard-1545166109026.json

Get top 10 longest running services:

topk(5, time() - device_boot_time* on (instance) group_left(region, firmware)  device_info{region="$region", firmware=~"$firmware"})

Crashing Devices

topk(10, changes(device_boot_time[1h])

The changes function looks for sudden jumps to the value of a gauge. This can be used to track the number of times a device reboots in an hour. For this demo, our simulated devices reboot 8 times per hour on average. If you use the dashboard selectors to choose Massachusetts as the region, you will see that one device is crashing much more frequently, and the dashboard uses conditional formatting to show this as red.

    alerts:
      groups:
        - name: device_alerts
          rules:
          - alert: LidLeftOpen
            expr: (time() - (lid_open_start_time *  lid_open_status)) > 900 and (time() - (lid_open_start_time *  lid_open_status)) < 9000
            for: 30s
            labels:
              severity: page
              system: mechanical
            annotations:
              summary: "Lid open on {{ $labels.instance }}"
              description: "Lid has remained open for more than 15 minutes on {{ $labels.instance }}"
          - alert: DeviceRebooting
            expr: changes(device_boot_time[1h]) > 20
            for: 2m
            labels:
              severity: debug
              system: software
            annotations:
              summary: "{{ $labels.instance }} rebooting"
              description: "{{ $labels.instance }} has been rebooting more than 20 times an hour"

Uptime from start timestamp

Metric value should be number of seconds since the epoch. The following will tell you how many hours since the workflow started:

(time() - argo_workflows_custom_start_time_gauge_workflow) / 3600

What range should I use with rate()?

The general rule for choosing the range is that it should be at least 4x the scrape interval. This is to allow for various races, and to be resilient to a failed scrape.

Source: https://www.robustperception.io/what-range-should-i-use-with-rate

Dashboard Design

Your services will have a rough tree structure, have a dashboard per service and walk the tree from the top when you have a problem. Similarly for each service, have dashboards per subsystem. Rule of Thumb: Limit of 5 graphs per dashboard, and 5 lines per graph.

Heatmap

Query: rate(foo_metric_bucket[10m])
Legend format: {{le}}
Format as: Heatmap

Time Range Variable

Variable Definition

Name: big_routine_job
Type: Query
Refresh: On Time Range Change
Query: query_result(topk(3, avg_over_time(go_goroutines[$__range])))
Regex: /job="([^"]+)"/

Dashboard Query

sum(rate(http_requests_total{job="$big_routine_job"}[10m])) by (handler)

Refining Rate

rate(requests[5m])
sum(rate(requests[5m])) by (service_name)
sum(rate(requests{service_name="catalogue"}[5m])) by (instance)

Histogram

Derive average request duration over a rolling 5 minute period

rate(request_duration_sum[5m])
  /
rate(request_duration_count[5m])

Top 5 Docker images by CPU

topk(5,
  sum by (image) (
    rate(container_cpu_usage_seconds_total{
      id=~"/system.slice/docker.*"}[5m]
    )
  )
)

Alerts

- alert: "Node CPU Usage"
  expr: (100 - (avg by (instance)
        (irate(node_cpu{app="prometheus-node-exporter", mode="idle"}[5m]))
        * 100)) > 70
  for: 30s
  annotations:
    miqTarget: "ExtManagementSystem"
    url: "https://www.example.com/high_node_cpu_fixing_instructions"
    description: "{{ $labels.instance }}: CPU usage is above 70%
          (current value is: {{ $value }})"
  labels:
    severity: "warning"

Kubernetes

CPU Utilization, Saturation, and Errors

The nodes in your cluster have resources. The most important resources your nodes provide in a Kubernetes cluster are CPU, Memory, Network and Disk. Let’s apply the USE method to all of these.

To calculate the amount of cpu utilization by host in your Kubernetes cluster we want to sum all the modes except for idle, iowait, guest, and guest_nice. The PromQL looks like this:

sum(rate(
         node_cpu{mode!=”idle”,
                  mode!=”iowait”,
                  mode!~”^(?:guest.*)$”
                  }[5m])) BY (instance)

A metric that the node-exporter gives us for saturation is the Unix load average. Loosely, the load average is the number of processes running plus the those waiting to run.

node_load1, node_load5 and node_load15 represent the 1, 5 and 15 min load averages. This metric is a gauge and is already averaged for you. As a standalone metric it is somewhat useless with knowing how many CPUs your node has. Is a load average of 10 good or bad? It depends. If you divide the load average by the number of CPUs you have in your cluster, then you get an approximation of the CPU saturation of your system.

node_exporter does not expose a count of node CPUs directly, but if you count just one of the above CPU modes, say “system”, you can get a CPU count by node:

count(node_cpu{mode="system"}) by (node)

Now you can normalize the node_load1 metric by the number of CPUs on the node expressed as a percentage:

sum(node_load1) by (node) / count(node_cpu{mode="system"}) by (node) * 100

The node_exporter does not reveal anything about CPU errors.

Memory Utilization, Saturation and Errors

It would seem that memory utilization and saturation are a somewhat easier to reason about as the amount of physical memory on a node is known. Again nothing is easy! node_exporter gives us 43 node_memory_* metrics to work with!

The amount of available memory on a Linux system is not just the reported “free” memory metric. Unix systems rely heavily on memory that is not in use by applications to share code (buffers) and to cache disk pages (cached). So one measure of available memory is:

sum(node_memory_MemFree + node_memory_Cached + node_memory_Buffers)

Newer Linux kernels (after 3.14) expose a better free memory metric, node_memory_MemAvailable.

Divide that by the total memory available on the node and you get a percentage of available memory, then subtract from 1 to get a measure of node memory utilization:

1 - sum(node_memory_MemAvailable) by (node)
/ sum(node_memory_MemTotal) by (node)

References

Enable TLS for RDS MySQL JDBC connections

2022-02-20T00:00:00+00:00

Introduction

We’re running a Java application with a connection string that looks like this with AWS RDS MySQL version 5.7:

jdbc:mysql://${db.endpoint}/${db.schema}?useConfigs=maxPerformance&characterEncoding=utf8&useSSL=false

As you can see all database queries will be sent to the RDS instance in plaintext.

Goals

We want:

transport encryption for database connections
to validate that the server’s identity is valid based on a CA certificate bundle

We do not require clients to authenticate themselves via client certificates.

RDS TLS options

MySQL uses yaSSL for secure connections in the following versions:

MySQL version 5.7.19 and earlier 5.7 versions
MySQL version 5.6.37 and earlier 5.6 versions

MySQL uses OpenSSL for secure connections in the following versions:

MySQL version 8.0
MySQL version 5.7.21 and later 5.7 versions
MySQL version 5.6.39 and later 5.6 versions

Amazon RDS for MySQL supports Transport Layer Security (TLS) versions 1.0, 1.1, and 1.2. The following table shows the TLS support for MySQL versions.

MySQL version	TLS 1.0	TLS 1.1	TLS 1.2
MySQL 8.0	Supported	Supported	Supported
MySQL 5.7	Supported	Supported	Supported for MySQL 5.7.21 and later
MySQL 5.6	Supported	Supported for MySQL 5.6.46 and later	Supported for MySQL 5.6.46 and later

Getting started

To validate the RDS server identity we need the RDS CA certificate chain that can be downloaded from here:

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL.html#UsingWithRDS.SSL.RegionCertificates

You can then test the CA chain with the mysql client:

$ wget https://truststore.pki.rds.amazonaws.com/eu-central-1/eu-central-1-bundle.pem
$ /usr/bin/mysql -P ${db_port} -h ${db_endpoint} \
    -u ${db_username} -p${db_password} \
    --ssl-ca=eu-central-1-bundle.pem --ssl-verify-server-cert contosodb 

# newer mysql clients use another parameter instead:
$ /usr/bin/mysql -P ${db_port} -h ${db_endpoint} \
    -u ${db_username} -p${db_password} \
    --ssl-ca=eu-central-1-bundle.pem --ssl-mode=VERIFY_IDENTITY contosodb  

You should always check certificate expiration dates:

$ openssl crl2pkcs7 -nocrl -certfile eu-central-1-bundle.pem \
    | openssl pkcs7 -print_certs -noout -text | rg -B3 'Subject:'
        Validity
            Not Before: Aug 22 17:08:50 2019 GMT
            Not After : Aug 22 17:08:50 2024 GMT
        Subject: C=US, L=Seattle, ST=Washington, O=Amazon Web Services, Inc., OU=Amazon RDS, CN=Amazon RDS Root 2019 CA
--
        Validity
            Not Before: Sep 11 19:36:20 2019 GMT
            Not After : Aug 22 17:08:50 2024 GMT
        Subject: C=US, ST=Washington, L=Seattle, O=Amazon Web Services, Inc., OU=Amazon RDS, CN=Amazon RDS eu-central-1 2019 CA
--
        Validity
            Not Before: May 21 22:33:24 2021 GMT
            Not After : May 21 23:33:24 2121 GMT
        Subject: C=US, O=Amazon Web Services, Inc., OU=Amazon RDS, ST=WA, CN=Amazon RDS eu-central-1 Root CA ECC384 G1, L=Seattle
--
        Validity
            Not Before: May 21 22:23:47 2021 GMT
            Not After : May 21 23:23:47 2061 GMT
        Subject: C=US, O=Amazon Web Services, Inc., OU=Amazon RDS, ST=WA, CN=Amazon RDS eu-central-1 Root CA RSA2048 G1, L=Seattle
--
        Validity
            Not Before: May 21 22:28:26 2021 GMT
            Not After : May 21 23:28:26 2121 GMT
        Subject: C=US, O=Amazon Web Services, Inc., OU=Amazon RDS, ST=WA, CN=Amazon RDS eu-central-1 Root CA RSA4096 G1, L=Seattle

I’ve split up the CA bundle into separate files with their names based on a slugified version of their common name (CN=...):

Amazon-RDS-Root-2019-CA.pem
Amazon-RDS-eu-central-1-2019-CA.pem
Amazon-RDS-eu-central-1-Root-CA-ECC384-G1.pem
Amazon-RDS-eu-central-1-Root-CA-RSA2048-G1.pem
Amazon-RDS-eu-central-1-Root-CA-RSA4096-G1.pem

Import the certificates into a truststore

Next we need to import them into a Java truststore that can be used by our JDBC connection:

Note: You should be using the newer pkcs12 (aka pfx) format instead of the older jks format for truststores.

#!/bin/bash

TRUSTSTORE_NAME=rds-truststore.pkcs12
TRUSTSTORE_PASSWORD=supersecretpassword

# Add certs to truststore
for CA in *.pem; do
  echo "Store: [${TRUSTSTORE_NAME}] - Importing [${CA}]"
  keytool -import -noprompt -trustcacerts -file "${CA}" -alias "${CA%%.*}" -storetype PKCS12 -keystore "${TRUSTSTORE_NAME}" -storepass "${TRUSTSTORE_PASSWORD}"
done

This will import all available *.pem files into a newly created truststore called rds-truststore.pkcs12.

Update connection settings

Place the truststore in your application’s environment - I placed it in /srv/aws-rds/truststore.pkcs12. Update the connection string and restart/redeploy the application:

# https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-connp-props-security.html
# newer v8 format
#db.url=jdbc:mysql://${db.endpoint}/${db.schema}?useConfigs=maxPerformance&characterEncoding=utf8&sslMode=VERIFY_CA&trustCertificateKeyStoreType=PKCS12&trustCertificateKeyStoreUrl=file:///srv/aws-rds/truststore.pkcs12&trustCertificateKeyStorePassword=supersecretpassword
# older v5.1 format
db.url=jdbc:mysql://${db.endpoint}/${db.schema}?useConfigs=maxPerformance&characterEncoding=utf8&useSSL=true&verifyServerCertificate=true&trustCertificateKeyStoreType=PKCS12&trustCertificateKeyStoreUrl=file:///srv/aws-rds/truststore.pkcs12&trustCertificateKeyStorePassword=supersecretpassword

Verify that TLS is being used

Everything should be starting up smoothly. You should see established DB connections in your RDS metrics. To verify that TLS is being used run the following query on the database server:

SELECT sbt.variable_value AS tls_version,  t2.variable_value AS cipher, 
         processlist_user AS user, processlist_host AS host 
FROM performance_schema.status_by_thread  AS sbt 
   JOIN performance_schema.threads AS t ON t.thread_id = sbt.thread_id 
   JOIN performance_schema.status_by_thread AS t2 ON t2.thread_id = t.thread_id 
WHERE sbt.variable_name = 'Ssl_version' AND t2.variable_name = 'Ssl_cipher' 
ORDER BY tls_version;

Empty tls_version and cipher columns mean that no transport encryption is used:

+-------------+-----------------------------+--------------------------------+---------------+
| tls_version | cipher                      | user                           | host          |
+-------------+-----------------------------+--------------------------------+---------------+
|             |                             | ...                            | ...           |
|             |                             | foobar-dev-01-user-backup      | 10.200.96.121 |
|             |                             | foobar-dev-03-user             | 10.200.96.34  |
|             |                             | foobar-int-01-user             | 10.200.98.126 |
| TLSv1.2     | ECDHE-RSA-AES256-GCM-SHA384 | myapp-laboratory-user          | 10.200.96.116 |
| TLSv1.2     | ECDHE-RSA-AES256-GCM-SHA384 | myapp-laboratory-user          | 10.200.96.116 |
| TLSv1.2     | ECDHE-RSA-AES256-GCM-SHA384 | myapp-laboratory-user          | 10.200.96.116 |
+-------------+-----------------------------+--------------------------------+---------------+

Ref:

Cheat Sheet - Monitoring

2022-02-17T00:00:00+00:00

SLA Monitoring

If you are not familiar with these terms, I would strongly recommend reading the article from Google’s SRE book on Service Level Objectives first.

In summary:

SLAs: Service Level Agreement¹
- What service you commit to provide to users, with possible penalties if you are not able to meet it.
- Example: “99.5%” availability.
- Keyword: contract
SLOs: Service Level Objective
- What you have internally set as a target, driving your measuring threshold (for example, on dashboards and alerting). In general, this should be stricter than your SLA.
- A natural structure for SLOs is thus: SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.
- Example: “99.9%” availability (the so called “three 9s”).
- Keyword: thresholds
SLIs: Service Level Indicators
- What you actually measure, to ascertain whether your SLOs are on/off-target.
- Example: error ratios, latency, QPS, availability
- Keyword: metrics

How to measure

From the above, it’s clear that we must have service metrics to tell us when the service is considered (un)available. There are several approaches for this:

Methodology	RED	USE
Meaning	Rate Errors Duration	Utilization Saturation Errors
Used for	request driven things like endpoints	resources like queues, caches, CPUs, disks

Services tend to fall into a few broad categories in terms of the SLIs they find relevant:

User-facing serving systems, such as frontends, generally care about availability, latency, and throughput. In other words: Could we respond to the request? How long did it take to respond? How many requests could be handled?
Storage systems often emphasize latency, availability, and durability. In other words: How long does it take to read or write data? Can we access the data on demand? Is the data still there when we need it? See Data Integrity: What You Read Is What You Wrote for an extended discussion of these issues.
Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end latency. In other words: How much data is being processed? How long does it take the data to progress from ingestion to completion? (Some pipelines may also have targets for latency on individual processing stages.)

All systems should care about correctness: was the right answer returned, the right data retrieved, the right analysis done? Correctness is important to track as an indicator of system health, even though it’s often a property of the data in the system rather than the infrastructure per se, and so usually not an SRE responsibility to meet.

For most services, the most straightforward way of representing risk tolerance is in terms of the acceptable level of unplanned downtime. Unplanned downtime is captured by the desired level of service availability, usually expressed in terms of the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability. For serving systems, this metric is traditionally calculated based on the proportion of system uptime:

Time-based availability

Using this formula over the period of a year, we can calculate the acceptable number of minutes of downtime to reach a given number of nines of availability:

\[availability = {uptime \over (uptime + downtime)}\]

For distributed systems a time-based metric for availability is usually not meaningful because we are looking across globally distributed services. Sufficient fault isolation makes it very likely that we are serving at least a subset of traffic for a given service somewhere in the world at any given time (i.e., we are at least partially “up” at all times). Therefore, instead of using metrics around uptime, we define availability in terms of the request success rate.

Aggregate availability

Aggregate availability shows how this yield-based metric is calculated over a rolling window (i.e., proportion of successful requests over a one-day window).

$availability = {successful requests \over total requests}$ [availability = {successful requests \over total requests}]

Comparison

Comparison	Time-based Availability	Aggregate Availability
System Type	Consumer	Infrastructure
Used for	Monoliths, user facing systems	distributed services, batch, pipeline, storage, caching layers and transactional systems
Measured as	proportion of system uptime	request success rate
Question to ask yourself	`How much time was the system unavailable?`	`What's the amount of errors across all service instances?` Helps with analyzing system performance by revealing bottlenecks or errors
Methodology to use	RED	USE

Most often, we set quarterly availability targets for a service and track our performance against those targets on a weekly, or even daily, basis. This strategy lets us manage the service to a high-level availability objective by looking for, tracking down, and fixing meaningful deviations as they inevitably arise. See Service Level Objectives for more details.

Define your SLIs and SLOs

Now that we have an idea what SLIs/SLOs/SLAs are it’s time for an example.

SLIs

Expression: [Metric Identifier][Operator][Metric Value]

95th percentile home page latency over 5 minutes < 500ms
Home page request response codes != 5xx
Home page requests served in < 100ms

SLOs

Expression: [Success Objective][SLI][Period]

99% of 95th percentile home page latency over 5 minutes < 500ms over the trailing month
99% of home page request response codes != 5xx over the last 7 days
95% of home page requests served in < 100ms over the last 24 hours

Identify system boundaries
Define capabilities exposed by each system
Plain-english definition of “available” for each capability
Define corresponding technical SLIs
Start measuring to get a baseline
Define SLO targets (per SLI or per capability)
Iterate and tune

Data Tier (a message bus system)

Tier	Capability	SLI	SLO
Ingest/Routing	Ingest	Percent of well-formed payloads accepted	99.9%
Ingest/Routing	Routing	Time to deliver message to correct destination	99.5% of messages in under 5 seconds
Horizontally Scaled Data Tier	Query Data	Latency
Horizontally Scaled Data Tier	Query Data	Correctness/Error rate

Note: SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial—a rebate or a penalty—but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask “what happens if the SLOs aren’t met?”: if there is no explicit consequence, then you are almost certainly looking at an SLO.

↩

Handle multiple servers with Bitwarden CLI

2022-02-16T00:00:00+00:00

Issue description

I recently ran into a problem with the Bitwarden CLI. We use it in our company but another company that I am doing contract work for also stores secrets in their own Bitwarden instance. Now to my dismay I learned that the bw CLI can only be connected to a single server and to switch servers you have to actively run bw logout.

Constantly running…

bw logout
bw config server https://vault.somedomain.tld
bw login

…is way to cumbersome so I wrote a small helper script to simplify the process.

Solution

Note: this requires fzf and jq.

In your Bitwarden CLI data dir you will find a data.json. That file represents the vault that is encrypted with your master password.

Steps:

In the data dir create a subfolder for each of your Bitwarden servers
Create a config.json in the data dir where the account property matches the respective subfolder name
Move the data.json into the subfolder
- configure and login next server (that’ll create a new data.json)
Repeat step 3 until all subfolders contain a Bitwarden vault
Save bwctx script

Now every time you run bwctx you can pick which company’s vault you want to work with. Run bwctx list to see which vault is currently active.

Here’s an example

$ bw config server https://vault.example.org/
$ bw login
# now connected to https://vault.example.org/ with 
# an encrypted data.json in the BW data dir
$ cd "$HOME/Library/Application Support/Bitwarden CLI"
$ mkdir eo
$ mv data.json ./eo/
$ bw config server https://vault.contoso.com/
$ bw login --sso # this server uses SSO so use OAUTH2 for the login
$ mkdir cc
$ mv data.json ./cc/
# create config.json and add bwctx to your PATH
$ bwctx
switching to: eo
$ bwctx list
current config: ./eo/data.json

bwctx

Store somewhere in your PATH, typically as /usr/local/bin/bwctx:

#!/bin/bash

# Linux: "$HOME/.config/Bitwarden CLI"
VAULT_PATH="$HOME/Library/Application Support/Bitwarden CLI"
CONF_PATH="${VAULT_PATH}/config.json"
cd "${VAULT_PATH}"

if [ "$#" -ge  "1" ]; then
    echo "current config: $(readlink data.json)"
    exit 0
fi

ACCOUNT=$(cat "${CONF_PATH}" | jq -r '.[].account' | fzf)

if [[ ! -z "${ACCOUNT}" ]]; then
    echo "switching to: ${ACCOUNT}"
    ln -sf "./${ACCOUNT}/data.json" data.json
fi

config.json

[
  {"server": "https://vault.contoso.com/", "account": "cc"},
  {"server": "https://vault.example.org/", "account": "eo"}
]

Directory structure

Library/Application Support/Bitwarden CLI                                                                                                                                                                              
▶ tree  
.
├── config.json
├── data.json -> ./eo/data.json
├── eo
│   └── data.json
└── cc
    └── data.json

Heredoc reference

2022-02-16T00:00:00+00:00

variable substitution, leading tab retained, overwrite file, echo to stdout

tee /path/to/file <<EOF
    ${variable}
EOF

no variable substitution, leading tab retained, overwrite file, echo to stdout

tee /path/to/file <<'EOF'
    ${variable}
EOF

variable substitution, leading tab removed, overwrite file, echo to stdout

tee /path/to/file <<-EOF
    ${variable}
EOF

variable substitution, leading tab retained, append to file, echo to stdout

tee -a /path/to/file <<EOF
    ${variable}
EOF

variable substitution, leading tab retained, overwrite file, no echo to stdout

tee /path/to/file <<EOF >/dev/null
${variable}
EOF

the above can be combined with sudo as well

sudo -u USER tee /path/to/file <<EOF
${variable}
EOF

Reference:

Seb’s IT blog

Applying changes to GitHub repositories in bulk

Getting started

Issue description

Init and clone

Plan

Push

Merge

Links

Cheat Sheet - github-cli

Authenticate with GitHub Enterprise

Repo bulk operations

Tip: Faster cloning for many repos

JSON queries

Disable pager

PRs

Workflow

Misc.

Quickly change default applications on MacOS

Installation

Examples

Command to hibernate Windows

Backing up Bitwarden including attachments

Cheat Sheet - Prometheus

USE Method in Prometheus

CPU Utilisation

CPU Saturation

Memory Utilisation

Memory Saturation

RED Method in Prometheus

Rate (QPS)

Errors

Duration (Latency)

Four Golden Signals

Misc.

Grouping

Get top 10 longest running services:

Crashing Devices

Uptime from start timestamp

What range should I use with rate()?

Dashboard Design

Heatmap

Time Range Variable

Variable Definition

Dashboard Query

Refining Rate

Histogram

Top 5 Docker images by CPU

Alerts

Kubernetes

CPU Utilization, Saturation, and Errors

Memory Utilization, Saturation and Errors

References

Enable TLS for RDS MySQL JDBC connections

Introduction

RDS TLS options

Getting started

Import the certificates into a truststore

Update connection settings

Verify that TLS is being used

Cheat Sheet - Monitoring

SLA Monitoring

How to measure

Time-based availability

Aggregate availability

Comparison

Define your SLIs and SLOs

SLIs

SLOs

Handle multiple servers with Bitwarden CLI

Issue description

Solution

bwctx

config.json

Directory structure

Heredoc reference