Utilizing Zero-Width Assertions with grep

Utilizing zero-width assertions with grep
Linux
Published

February 1, 2024

Recently, I needed to extract a set of port numbers from a .json file in order to connect to a remotely running IPython kernel instance locally, and learned that it is possible to use zero-width assertions with grep. Zero-width assertions are regular expressions that match a specific pattern without consuming, so they can be used to anchor your regular expression within the target text.

The IPython kernel file structure is consistent across invocations, but each new file contains different random port numbers which need to be forwarded between local and remotehost. These ports facilitate the connection to the instance of the IPython kernel running on remotehost. What follows is the content of a typical kernel file with randomly-generated ports (the filename is also random, but follows the format kernel-####.json):

{
  "iopub_port": 39736,
  "control_port": 59725,
  "transport": "tcp",
  "shell_port": 51963,
  "key": "1fcd997c-ef64-4322-8762-c034af6095e1",
  "stdin_port": 59714,
  "signature_scheme": "hmac-sha256",
  "hb_port": 41128,
  "ip": "127.0.0.1"
}

Our goal is to extract and forward the randomly generated port numbers using ssh. One approach is provided in the IPython Cookbook recipe, which extracts the ports and forwards them iteratively:

#!/bin/bash

# Assume kernel connection details reside in `kernel-2323.json`

for port in $(cat kernel-2323.json | grep '_port' | grep -o '[0-9]\+'); do
    ssh remotehost.com -f -N -L $port:127.0.0.1:$port
done

Where:

This method works without issue. However, the pattern extracts port numbers without considering the port’s associated kernel component. For example, if we needed to know which port corresponds to shell_port, this solution falls short.

An alternative approach uses grep’s zero-width assertion operator \K. This option isn’t listed in grep’s help menu or man page, but is nonetheless valid syntactically (the -P flag indicates that the pattern is a Perl regular expression). Simply provide grep with any valid regular expression pattern: If \K is included within the regular expression, the matching text that follows will be returned if and only if what precedes it also matches. This is also known as a positive lookbehind assertion.

The next example parses kernel-2323.json as before, but this time retains the component-to-port mapping. After extracting the kernel component names and ports, we save them to an associative array:

#!/bin/bash

KERNEL_FILENAME="kernel-2323.json"

declare -a portsArr=('hb_port' 'iopub_port' 'control_port' 'shell_port' 'stdin_port');
declare -A kernelDict  # Associative array to hold component-port mapping

for portname in "${portsArr[@]}"
do
    PATTERN="[[:space:]]+\"${portname}\":[[:space:]]+\K[0-9]{2,5}"
    PORTNBR=$(grep -Po ${PATTERN} "${KERNEL_FILENAME}")
    echo "Now forwarding ${portname}..."
    ssh remotehost.com -f -N -L ${PORTNBR}:127.0.0.1:${PORTNBR}
    # Add component-port mapping to kernelDict.
    kernelDict["${portname}"]="${PORTNBR}"
done

Notice the placement of \K: At each iteration, the pattern specifies that a matching string will contain the port name followed by a colon and one or more whitespace characters, followed by 2-5 digits. Since \K directly precedes “[0-9]{2,5}” successful matches will only return that portion of a matching string.

Our implementation works as expected but is inefficient: For each port number extracted and forwarded, the kernel file is reopened and reread. For this example it’s not much of a problem, but for larger files, this approach could result in serious performance degradation. A more efficient solution would read the kernel file in one time, storing it in a variable, and searching this variable against the regular expression pattern at each iteration. The change in logic is subtle: the only difference is reading the file into the variable identified as KERNEL_CONTENTS at the start of the script, and the inclusion of <<< after the grep command:

#!/bin/bash

KERNEL_FILENAME="kernel-2323.json"
KERNEL_CONTENTS="$(cat ${KERNEL_FILENAME})"

declare -a portsArr=('hb_port' 'iopub_port' 'control_port' 'shell_port' 'stdin_port');
declare -A kernelDict  # Associative array to hold component-port mapping

for portname in "${portsArr[@]}"
do
    PATTERN="[[:space:]]+\"${portname}\":[[:space:]]+\K[0-9]{2,5}"
    PORTNBR=$(grep -Po ${PATTERN} <<< "${KERNEL_CONTENTS}")
    echo "Now forwarding ${portname}..."
    ssh remotehost.com -f -N -L ${PORTNBR}:127.0.0.1:${PORTNBR}
    # Add component-port mapping to kernelDict.
    kernelDict["${portname}"]="${PORTNBR}"
done

The <<< syntax is used to indicate a here string, a form of input redirection which allows variables containing text to be interpreted as a file-like object. See this link for more information.