Cheat Sheet - PowerShell Regex

Intro

The following characters are reserved: []().\^$|?*+{}. You’ll need to escape these characters in your patterns to match them in your input strings.

There’s a static method of the regex class that can escape text for you.

PS> [regex]::escape('3.\d{2,}')
3\.\\d\{2,}

Ref:

Named Capture Groups

PS> $string = 'The last logged on user was CONTOSO\jsmith'
PS> $string -match 'was (?<domain>.+)\\(?<user>.+)'
True

PS> $Matches

Name                           Value
----                           -----
domain                         CONTOSO
user                           jsmith
0                              was CONTOSO\jsmith

PS> $Matches.domain
CONTOSO

PS> $Matches.user
jsmith

Because $Matches is of type [Hashtable] we can convert it directly to a [PSCustomObject]:

$Keys = Get-ChildItem -Path $HOME/.ssh/ -Filter *.pub
$Regex = '^(?<BitLength>\d+) (?<HashType>[^:]+):(?<Fingerprint>[^\s]+) (?<Comment>.+) \((?<KeyType>\w+)\)$'

foreach ($KeyFile in $Keys) {
    $KeyMetadata = ssh-keygen -lf $KeyFile.FullName
    if ($KeyMetadata -match $Regex) {
        $Matches.Remove(0)
        [PSCustomObject]$Matches
    }
}

If you need the properties to be in a specific order this won’t work. But you can use a class for that instead:

class KeyItem {
    [string] $Fingerprint
    [string] $KeyType
    [int] $BitLength
    [string] $HashType
    [bool] $IsLoaded
    [string] $Comment
    [System.IO.FileInfo] $File

    [string] GetID() {
        return "{0}:{1}" -f $this.HashType, $this.Fingerprint
    }
}

$LoadedKeyIDs = ssh-add -l | awk '{print $2}'
$Keys = Get-ChildItem -Path $HOME/.ssh/ -Filter *.pub
$Regex = '^(?<BitLength>\d+) (?<HashType>[^:]+):(?<Fingerprint>[^\s]+) (?<Comment>.+) \((?<KeyType>\w+)\)$'

foreach ($KeyFile in $Keys) {
    $KeyMetadata = ssh-keygen -lf $KeyFile.FullName
    if ($KeyMetadata -match $Regex) {
        $Matches.Remove(0)
        $Result = [KeyItem]$Matches
        $Result.File = $KeyFile
        $Result.IsLoaded = $LoadedKeyIDs -contains $Result.GetID()
        $Result
    }
}

<#
Output:

[...]

Fingerprint : 2JGnPl42MSbvEwomltiTqyIrWV8VeNVY2guShUbmv4E
KeyType     : RSA
BitLength   : 4096
HashType    : SHA256
IsLoaded    : False
Comment     : SSH Key for corporate git access
File        : /Users/megamorf/.ssh/id_rsa_megamorf_corp_git_2020-07-17.pub

#>

Substitutions

The substitution is done by using the $ character before the group identifier.

Two ways to reference capturing groups are by Number and by Name.

  • By Number - Capturing Groups are numbered from left to right.

    PS> 'John D. Smith' -replace '(\w+) (\w+)\. (\w+)', '$1.$2.$3@contoso.com'
    John.D.Smith@contoso.com
    
  • By Name - Capturing Groups can also be referenced by name.

    PS> 'CONTOSO\Administrator' -replace '\w+\\(?<user>\w+)', 'FABRIKAM\${user}'
    FABRIKAM\Administrator
    

The $& expression represents all the text matched.

PS> 'Gobble' -replace 'Gobble', '$& $&'
Gobble Gobble

:warning: WARNING
Since the $ character is used in string expansion, you’ll need to use literal strings with substitution, or escape the $ character when using double quotes.

'Hello World' -replace '(\w+) \w+', '$1 Universe'
"Hello World" -replace "(\w+) \w+", "`$1 Universe"
Hello Universe
Hello Universe

Additionally, if you want to have the $ as a literal character, use $$ instead of the normal escape characters. When using double quotes, still escape all instances of $ to avoid incorrect substitution.

'5.72' -replace '(.+)', '$$$1'
"5.72" -replace "(.+)", "`$`$`$1"
$5.72
$5.72

Unicode Code Point ranges

$s = '肖申克的救赎The '
$regex = '[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff66-\uff9f]'
PS> $s -match $regex
True
PS> Write-Host ("Updated string: [{0}]" -f ($s -replace $regex))
Updated string: [The ]

Explanation:

The ranges of Unicode characters which are routinely used for Chinese and Japanese text are:

  • U+3040 - U+30FF: hiragana and katakana (Japanese only)
  • U+3400 - U+4DBF: CJK unified ideographs extension A (Chinese, Japanese, and Korean)
  • U+4E00 - U+9FFF: CJK unified ideographs (Chinese, Japanese, and Korean)
  • U+F900 - U+FAFF: CJK compatibility ideographs (Chinese, Japanese, and Korean)
  • U+FF66 - U+FF9F: half-width katakana (Japanese only)

As a regular expression, this would be expressed as:

/[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff66-\uff9f]/

This does not include every character which will appear in Chinese and Japanese text, but any significant piece of typical Chinese or Japanese text will be mostly made up of characters from these ranges.

Note that this regular expression will also match on Korean text that contains hanja. This is an unavoidable result of Han unification.

Unicode regex’s let you use code-point ranges or: 1 scripts, [2] blocks, or [3] categories

Blocks are sequential:

U+3400 - U+4DBF is \p{InCJK_Unified_Ideographs_Extension_A} U+4E00 - U+9FFF is \p{InCJK_Unified_Ideographs}

quote (from below) Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han, and Latin scripts that Japanese documents are usually composed of.

Here are some refs:

Regex Options

There are overloads of the static [Regex]::Match() method that allow to provide the desired [RegexOptions] programmatically:

Options are ([System.Text.RegularExpressions.RegexOptions] | Get-Member -Static -MemberType Property):

  • Compiled
  • CultureInvariant
  • ECMAScript
  • ExplicitCapture
  • IgnoreCase
  • IgnorePatternWhitespace
  • Multiline
  • None
  • RightToLeft
  • Singleline
# You can combine several options by doing a bitwise or:
$options = [Text.RegularExpressions.RegexOptions]::IgnoreCase -bor [Text.RegularExpressions.RegexOptions]::CultureInvariant
# or by letting casting do the magic:
$options = [Text.RegularExpressions.RegexOptions]'IgnoreCase, CultureInvariant'

$match = [regex]::Match($input, $regex, $options)

Ref: