Intro
The following characters are reserved: []().\^$|?*+{}
. You’ll need to escape these characters in your patterns to match them in your input strings.
There’s a static method of the regex class that can escape text for you.
PS> [regex]::escape('3.\d{2,}')
3\.\\d\{2,}
Ref:
- https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference
- https://docs.microsoft.com/en-us/dotnet/standard/base-types/grouping-constructs-in-regular-expressions
Named Capture Groups
PS> $string = 'The last logged on user was CONTOSO\jsmith'
PS> $string -match 'was (?<domain>.+)\\(?<user>.+)'
True
PS> $Matches
Name Value
---- -----
domain CONTOSO
user jsmith
0 was CONTOSO\jsmith
PS> $Matches.domain
CONTOSO
PS> $Matches.user
jsmith
Because $Matches
is of type [Hashtable]
we can convert it directly to a [PSCustomObject]
:
$Keys = Get-ChildItem -Path $HOME/.ssh/ -Filter *.pub
$Regex = '^(?<BitLength>\d+) (?<HashType>[^:]+):(?<Fingerprint>[^\s]+) (?<Comment>.+) \((?<KeyType>\w+)\)$'
foreach ($KeyFile in $Keys) {
$KeyMetadata = ssh-keygen -lf $KeyFile.FullName
if ($KeyMetadata -match $Regex) {
$Matches.Remove(0)
[PSCustomObject]$Matches
}
}
If you need the properties to be in a specific order this won’t work. But you can use a class for that instead:
class KeyItem {
[string] $Fingerprint
[string] $KeyType
[int] $BitLength
[string] $HashType
[bool] $IsLoaded
[string] $Comment
[System.IO.FileInfo] $File
[string] GetID() {
return "{0}:{1}" -f $this.HashType, $this.Fingerprint
}
}
$LoadedKeyIDs = ssh-add -l | awk '{print $2}'
$Keys = Get-ChildItem -Path $HOME/.ssh/ -Filter *.pub
$Regex = '^(?<BitLength>\d+) (?<HashType>[^:]+):(?<Fingerprint>[^\s]+) (?<Comment>.+) \((?<KeyType>\w+)\)$'
foreach ($KeyFile in $Keys) {
$KeyMetadata = ssh-keygen -lf $KeyFile.FullName
if ($KeyMetadata -match $Regex) {
$Matches.Remove(0)
$Result = [KeyItem]$Matches
$Result.File = $KeyFile
$Result.IsLoaded = $LoadedKeyIDs -contains $Result.GetID()
$Result
}
}
<#
Output:
[...]
Fingerprint : 2JGnPl42MSbvEwomltiTqyIrWV8VeNVY2guShUbmv4E
KeyType : RSA
BitLength : 4096
HashType : SHA256
IsLoaded : False
Comment : SSH Key for corporate git access
File : /Users/megamorf/.ssh/id_rsa_megamorf_corp_git_2020-07-17.pub
#>
Substitutions
The substitution is done by using the $
character before the group identifier.
Two ways to reference capturing groups are by Number and by Name.
-
By Number - Capturing Groups are numbered from left to right.
PS> 'John D. Smith' -replace '(\w+) (\w+)\. (\w+)', '$1.$2.$3@contoso.com' John.D.Smith@contoso.com
-
By Name - Capturing Groups can also be referenced by name.
PS> 'CONTOSO\Administrator' -replace '\w+\\(?<user>\w+)', 'FABRIKAM\${user}' FABRIKAM\Administrator
The $&
expression represents all the text matched.
PS> 'Gobble' -replace 'Gobble', '$& $&'
Gobble Gobble
WARNING
Since the$
character is used in string expansion, you’ll need to use literal strings with substitution, or escape the$
character when using double quotes.'Hello World' -replace '(\w+) \w+', '$1 Universe' "Hello World" -replace "(\w+) \w+", "`$1 Universe"
Hello Universe Hello Universe
Additionally, if you want to have the
$
as a literal character, use$$
instead of the normal escape characters. When using double quotes, still escape all instances of$
to avoid incorrect substitution.'5.72' -replace '(.+)', '$$$1' "5.72" -replace "(.+)", "`$`$`$1"
$5.72 $5.72
Unicode Code Point ranges
$s = '肖申克的救赎The '
$regex = '[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff66-\uff9f]'
PS> $s -match $regex
True
PS> Write-Host ("Updated string: [{0}]" -f ($s -replace $regex))
Updated string: [The ]
Explanation:
The ranges of Unicode characters which are routinely used for Chinese and Japanese text are:
- U+3040 - U+30FF: hiragana and katakana (Japanese only)
- U+3400 - U+4DBF: CJK unified ideographs extension A (Chinese, Japanese, and Korean)
- U+4E00 - U+9FFF: CJK unified ideographs (Chinese, Japanese, and Korean)
- U+F900 - U+FAFF: CJK compatibility ideographs (Chinese, Japanese, and Korean)
- U+FF66 - U+FF9F: half-width katakana (Japanese only)
As a regular expression, this would be expressed as:
/[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff66-\uff9f]/
This does not include every character which will appear in Chinese and Japanese text, but any significant piece of typical Chinese or Japanese text will be mostly made up of characters from these ranges.
Note that this regular expression will also match on Korean text that contains hanja. This is an unavoidable result of Han unification.
Unicode regex’s let you use code-point ranges or: 1 scripts, [2] blocks, or [3] categories
Blocks are sequential:
U+3400 - U+4DBF is
\p{InCJK_Unified_Ideographs_Extension_A}
U+4E00 - U+9FFF is\p{InCJK_Unified_Ideographs}
quote (from below) Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han, and Latin scripts that Japanese documents are usually composed of.
Here are some refs:
- https://www.fileformat.info/info/unicode/block/index.htm
- https://www.fileformat.info/info/unicode/category/index.htm
- https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#supported-named-blocks
- https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#supported-unicode-general-categories
Regex Options
There are overloads of the static [Regex]::Match()
method that allow to provide the desired [RegexOptions]
programmatically:
Options are ([System.Text.RegularExpressions.RegexOptions] | Get-Member -Static -MemberType Property
):
- Compiled
- CultureInvariant
- ECMAScript
- ExplicitCapture
- IgnoreCase
- IgnorePatternWhitespace
- Multiline
- None
- RightToLeft
- Singleline
# You can combine several options by doing a bitwise or:
$options = [Text.RegularExpressions.RegexOptions]::IgnoreCase -bor [Text.RegularExpressions.RegexOptions]::CultureInvariant
# or by letting casting do the magic:
$options = [Text.RegularExpressions.RegexOptions]'IgnoreCase, CultureInvariant'
$match = [regex]::Match($input, $regex, $options)
Ref:
- https://stackoverflow.com/a/52336328/3151055
- https://www.reddit.com/r/PowerShell/comments/gv5daq/removing_nonchinese_characters_from_a_txt_file/
- https://stackoverflow.com/questions/9576384/use-regular-expression-to-match-any-chinese-character-in-utf-8-encoding
- https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions
- https://devblogs.microsoft.com/powershell/parsing-text-with-powershell-1-3/
- https://jdhitsolutions.com/blog/powershell/6791/capturing-names-with-powershell-and-regular-expressions/
- https://www.reddit.com/r/PowerShell/comments/gz6h2k/on_filtering_large_text_files/