| \input texinfo @c -*-texinfo-*- |
| @c |
| @c -- Stuff that needs adding: ---------------------------------------------- |
| @c (document the `;' command-separator) |
| @c -------------------------------------------------------------------------- |
| @c Check for consistency: regexps in @code, text that they match in @samp. |
| @c |
| @c Tips: |
| @c @command for command |
| @c @samp for command fragments: @samp{cat -s} |
| @c @code for sed commands and flags |
| @c Use ``quote'' not `quote' or "quote". |
| @c |
| @c %**start of header |
| @setfilename sed.info |
| @settitle sed, a stream editor |
| @c %**end of header |
| |
| @c @smallbook |
| |
| @include version.texi |
| |
| @c Combine indices. |
| @syncodeindex ky cp |
| @syncodeindex pg cp |
| @syncodeindex tp cp |
| |
| @defcodeindex op |
| @syncodeindex op fn |
| |
| @include config.texi |
| |
| @copying |
| This file documents version @value{VERSION} of |
| @value{SSED}, a stream editor. |
| |
| Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free |
| Software Foundation, Inc. |
| |
| This document is released under the terms of the @acronym{GNU} Free |
| Documentation License as published by the Free Software Foundation; |
| either version 1.1, or (at your option) any later version. |
| |
| You should have received a copy of the @acronym{GNU} Free Documentation |
| License along with @value{SSED}; see the file @file{COPYING.DOC}. |
| If not, write to the Free Software Foundation, 59 Temple Place - Suite |
| 330, Boston, MA 02110-1301, USA. |
| |
| There are no Cover Texts and no Invariant Sections; this text, along |
| with its equivalent in the printed manual, constitutes the Title Page. |
| @end copying |
| |
| @setchapternewpage off |
| |
| @titlepage |
| @title @command{sed}, a stream editor |
| @subtitle version @value{VERSION}, @value{UPDATED} |
| @author by Ken Pizzini, Paolo Bonzini |
| |
| @page |
| @vskip 0pt plus 1filll |
| Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc. |
| |
| @insertcopying |
| |
| Published by the Free Software Foundation, @* |
| 51 Franklin Street, Fifth Floor @* |
| Boston, MA 02110-1301, USA |
| @end titlepage |
| |
| |
| @node Top |
| @top |
| |
| @ifnottex |
| @insertcopying |
| @end ifnottex |
| |
| @menu |
| * Introduction:: Introduction |
| * Invoking sed:: Invocation |
| * sed Programs:: @command{sed} programs |
| * Examples:: Some sample scripts |
| * Limitations:: Limitations and (non-)limitations of @value{SSED} |
| * Other Resources:: Other resources for learning about @command{sed} |
| * Reporting Bugs:: Reporting bugs |
| |
| * Extended regexps:: @command{egrep}-style regular expressions |
| @ifset PERL |
| * Perl regexps:: Perl-style regular expressions |
| @end ifset |
| |
| * Concept Index:: A menu with all the topics in this manual. |
| * Command and Option Index:: A menu with all @command{sed} commands and |
| command-line options. |
| |
| @detailmenu |
| --- The detailed node listing --- |
| |
| sed Programs: |
| * Execution Cycle:: How @command{sed} works |
| * Addresses:: Selecting lines with @command{sed} |
| * Regular Expressions:: Overview of regular expression syntax |
| * Common Commands:: Often used commands |
| * The "s" Command:: @command{sed}'s Swiss Army Knife |
| * Other Commands:: Less frequently used commands |
| * Programming Commands:: Commands for @command{sed} gurus |
| * Extended Commands:: Commands specific of @value{SSED} |
| * Escapes:: Specifying special characters |
| |
| Examples: |
| * Centering lines:: |
| * Increment a number:: |
| * Rename files to lower case:: |
| * Print bash environment:: |
| * Reverse chars of lines:: |
| * tac:: Reverse lines of files |
| * cat -n:: Numbering lines |
| * cat -b:: Numbering non-blank lines |
| * wc -c:: Counting chars |
| * wc -w:: Counting words |
| * wc -l:: Counting lines |
| * head:: Printing the first lines |
| * tail:: Printing the last lines |
| * uniq:: Make duplicate lines unique |
| * uniq -d:: Print duplicated lines of input |
| * uniq -u:: Remove all duplicated lines |
| * cat -s:: Squeezing blank lines |
| |
| @ifset PERL |
| Perl regexps:: Perl-style regular expressions |
| * Backslash:: Introduces special sequences |
| * Circumflex/dollar sign/period:: Behave specially with regard to new lines |
| * Square brackets:: Are a bit different in strange cases |
| * Options setting:: Toggle modifiers in the middle of a regexp |
| * Non-capturing subpatterns:: Are not counted when backreferencing |
| * Repetition:: Allows for non-greedy matching |
| * Backreferences:: Allows for more than 10 back references |
| * Assertions:: Allows for complex look ahead matches |
| * Non-backtracking subpatterns:: Often gives more performance |
| * Conditional subpatterns:: Allows if/then/else branches |
| * Recursive patterns:: For example to match parentheses |
| * Comments:: Because things can get complex... |
| @end ifset |
| |
| @end detailmenu |
| @end menu |
| |
| |
| @node Introduction |
| @chapter Introduction |
| |
| @cindex Stream editor |
| @command{sed} is a stream editor. |
| A stream editor is used to perform basic text |
| transformations on an input stream |
| (a file or input from a pipeline). |
| While in some ways similar to an editor which |
| permits scripted edits (such as @command{ed}), |
| @command{sed} works by making only one pass over the |
| input(s), and is consequently more efficient. |
| But it is @command{sed}'s ability to filter text in a pipeline |
| which particularly distinguishes it from other types of |
| editors. |
| |
| |
| @node Invoking sed |
| @chapter Invocation |
| |
| Normally @command{sed} is invoked like this: |
| |
| @example |
| sed SCRIPT INPUTFILE... |
| @end example |
| |
| The full format for invoking @command{sed} is: |
| |
| @example |
| sed OPTIONS... [SCRIPT] [INPUTFILE...] |
| @end example |
| |
| If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-}, |
| @command{sed} filters the contents of the standard input. The @var{script} |
| is actually the first non-option parameter, which @command{sed} specially |
| considers a script and not an input file if (and only if) none of the |
| other @var{options} specifies a script to be executed, that is if neither |
| of the @option{-e} and @option{-f} options is specified. |
| |
| @command{sed} may be invoked with the following command-line options: |
| |
| @table @code |
| @item --version |
| @opindex --version |
| @cindex Version, printing |
| Print out the version of @command{sed} that is being run and a copyright notice, |
| then exit. |
| |
| @item --help |
| @opindex --help |
| @cindex Usage summary, printing |
| Print a usage message briefly summarizing these command-line options |
| and the bug-reporting address, |
| then exit. |
| |
| @item -n |
| @itemx --quiet |
| @itemx --silent |
| @opindex -n |
| @opindex --quiet |
| @opindex --silent |
| @cindex Disabling autoprint, from command line |
| By default, @command{sed} prints out the pattern space |
| at the end of each cycle through the script (@pxref{Execution Cycle, , |
| How @code{sed} works}). |
| These options disable this automatic printing, |
| and @command{sed} only produces output when explicitly told to |
| via the @code{p} command. |
| |
| @item -e @var{script} |
| @itemx --expression=@var{script} |
| @opindex -e |
| @opindex --expression |
| @cindex Script, from command line |
| Add the commands in @var{script} to the set of commands to be |
| run while processing the input. |
| |
| @item -f @var{script-file} |
| @itemx --file=@var{script-file} |
| @opindex -f |
| @opindex --file |
| @cindex Script, from a file |
| Add the commands contained in the file @var{script-file} |
| to the set of commands to be run while processing the input. |
| |
| @item -i[@var{SUFFIX}] |
| @itemx --in-place[=@var{SUFFIX}] |
| @opindex -i |
| @opindex --in-place |
| @cindex In-place editing, activating |
| @cindex @value{SSEDEXT}, in-place editing |
| This option specifies that files are to be edited in-place. |
| @value{SSED} does this by creating a temporary file and |
| sending output to this file rather than to the standard |
| output.@footnote{This applies to commands such as @code{=}, |
| @code{a}, @code{c}, @code{i}, @code{l}, @code{p}. You can |
| still write to the standard output by using the @code{w} |
| @cindex @value{SSEDEXT}, @file{/dev/stdout} file |
| or @code{W} commands together with the @file{/dev/stdout} |
| special file}. |
| |
| This option implies @option{-s}. |
| |
| When the end of the file is reached, the temporary file is |
| renamed to the output file's original name. The extension, |
| if supplied, is used to modify the name of the old file |
| before renaming the temporary file, thereby making a backup |
| copy@footnote{Note that @value{SSED} creates the backup |
| file whether or not any output is actually changed.}). |
| |
| @cindex In-place editing, Perl-style backup file names |
| This rule is followed: if the extension doesn't contain a @code{*}, |
| then it is appended to the end of the current filename as a |
| suffix; if the extension does contain one or more @code{*} |
| characters, then @emph{each} asterisk is replaced with the |
| current filename. This allows you to add a prefix to the |
| backup file, instead of (or in addition to) a suffix, or |
| even to place backup copies of the original files into another |
| directory (provided the directory already exists). |
| |
| If no extension is supplied, the original file is |
| overwritten without making a backup. |
| |
| @item -l @var{N} |
| @itemx --line-length=@var{N} |
| @opindex -l |
| @opindex --line-length |
| @cindex Line length, setting |
| Specify the default line-wrap length for the @code{l} command. |
| A length of 0 (zero) means to never wrap long lines. If |
| not specified, it is taken to be 70. |
| |
| @item --posix |
| @cindex @value{SSEDEXT}, disabling |
| @value{SSED} includes several extensions to @acronym{POSIX} |
| sed. In order to simplify writing portable scripts, this |
| option disables all the extensions that this manual documents, |
| including additional commands. |
| @cindex @code{POSIXLY_CORRECT} behavior, enabling |
| Most of the extensions accept @command{sed} programs that |
| are outside the syntax mandated by @acronym{POSIX}, but some |
| of them (such as the behavior of the @command{N} command |
| described in @pxref{Reporting Bugs}) actually violate the |
| standard. If you want to disable only the latter kind of |
| extension, you can set the @code{POSIXLY_CORRECT} variable |
| to a non-empty value. |
| |
| @item -b |
| @itemx --binary |
| @opindex -b |
| @opindex --binary |
| This option is available on every platform, but is only effective where the |
| operating system makes a distinction between text files and binary files. |
| When such a distinction is made---as is the case for MS-DOS, Windows, |
| Cygwin---text files are composed of lines separated by a carriage return |
| @emph{and} a line feed character, and @command{sed} does not see the |
| ending CR. When this option is specified, @command{sed} will open |
| input files in binary mode, thus not requesting this special processing |
| and considering lines to end at a line feed. |
| |
| @item --follow-symlinks |
| @opindex --follow-symlinks |
| This option is available only on platforms that support |
| symbolic links and has an effect only if option @option{-i} |
| is specified. In this case, if the file that is specified |
| on the command line is a symbolic link, @command{sed} will |
| follow the link and edit the ultimate destination of the |
| link. The default behavior is to break the symbolic link, |
| so that the link destination will not be modified. |
| |
| @item -r |
| @itemx --regexp-extended |
| @opindex -r |
| @opindex --regexp-extended |
| @cindex Extended regular expressions, choosing |
| @cindex @acronym{GNU} extensions, extended regular expressions |
| Use extended regular expressions rather than basic |
| regular expressions. Extended regexps are those that |
| @command{egrep} accepts; they can be clearer because they |
| usually have less backslashes, but are a @acronym{GNU} extension |
| and hence scripts that use them are not portable. |
| @xref{Extended regexps, , Extended regular expressions}. |
| |
| @ifset PERL |
| @item -R |
| @itemx --regexp-perl |
| @opindex -R |
| @opindex --regexp-perl |
| @cindex Perl-style regular expressions, choosing |
| @cindex @value{SSEDEXT}, Perl-style regular expressions |
| Use Perl-style regular expressions rather than basic |
| regular expressions. Perl-style regexps are extremely |
| powerful but are a @value{SSED} extension and hence scripts that |
| use it are not portable. @xref{Perl regexps, , |
| Perl-style regular expressions}. |
| @end ifset |
| |
| @item -s |
| @itemx --separate |
| @cindex Working on separate files |
| By default, @command{sed} will consider the files specified on the |
| command line as a single continuous long stream. This @value{SSED} |
| extension allows the user to consider them as separate files: |
| range addresses (such as @samp{/abc/,/def/}) are not allowed |
| to span several files, line numbers are relative to the start |
| of each file, @code{$} refers to the last line of each file, |
| and files invoked from the @code{R} commands are rewound at the |
| start of each file. |
| |
| @item -u |
| @itemx --unbuffered |
| @opindex -u |
| @opindex --unbuffered |
| @cindex Unbuffered I/O, choosing |
| Buffer both input and output as minimally as practical. |
| (This is particularly useful if the input is coming from |
| the likes of @samp{tail -f}, and you wish to see the transformed |
| output as soon as possible.) |
| |
| @end table |
| |
| If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file} |
| options are given on the command-line, |
| then the first non-option argument on the command line is |
| taken to be the @var{script} to be executed. |
| |
| @cindex Files to be processed as input |
| If any command-line parameters remain after processing the above, |
| these parameters are interpreted as the names of input files to |
| be processed. |
| @cindex Standard input, processing as input |
| A file name of @samp{-} refers to the standard input stream. |
| The standard input will be processed if no file names are specified. |
| |
| |
| @node sed Programs |
| @chapter @command{sed} Programs |
| |
| @cindex @command{sed} program structure |
| @cindex Script structure |
| A @command{sed} program consists of one or more @command{sed} commands, |
| passed in by one or more of the |
| @option{-e}, @option{-f}, @option{--expression}, and @option{--file} |
| options, or the first non-option argument if zero of these |
| options are used. |
| This document will refer to ``the'' @command{sed} script; |
| this is understood to mean the in-order catenation |
| of all of the @var{script}s and @var{script-file}s passed in. |
| |
| Each @code{sed} command consists of an optional address or |
| address range, followed by a one-character command name |
| and any additional command-specific code. |
| |
| @menu |
| * Execution Cycle:: How @command{sed} works |
| * Addresses:: Selecting lines with @command{sed} |
| * Regular Expressions:: Overview of regular expression syntax |
| * Common Commands:: Often used commands |
| * The "s" Command:: @command{sed}'s Swiss Army Knife |
| * Other Commands:: Less frequently used commands |
| * Programming Commands:: Commands for @command{sed} gurus |
| * Extended Commands:: Commands specific of @value{SSED} |
| * Escapes:: Specifying special characters |
| @end menu |
| |
| |
| @node Execution Cycle |
| @section How @command{sed} Works |
| |
| @cindex Buffer spaces, pattern and hold |
| @cindex Spaces, pattern and hold |
| @cindex Pattern space, definition |
| @cindex Hold space, definition |
| @command{sed} maintains two data buffers: the active @emph{pattern} space, |
| and the auxiliary @emph{hold} space. Both are initially empty. |
| |
| @command{sed} operates by performing the following cycle on each |
| lines of input: first, @command{sed} reads one line from the input |
| stream, removes any trailing newline, and places it in the pattern space. |
| Then commands are executed; each command can have an address associated |
| to it: addresses are a kind of condition code, and a command is only |
| executed if the condition is verified before the command is to be |
| executed. |
| |
| When the end of the script is reached, unless the @option{-n} option |
| is in use, the contents of pattern space are printed out to the output |
| stream, adding back the trailing newline if it was removed.@footnote{Actually, |
| if @command{sed} prints a line without the terminating newline, it will |
| nevertheless print the missing newline as soon as more text is sent to |
| the same output stream, which gives the ``least expected surprise'' |
| even though it does not make commands like @samp{sed -n p} exactly |
| identical to @command{cat}.} Then the next cycle starts for the next |
| input line. |
| |
| Unless special commands (like @samp{D}) are used, the pattern space is |
| deleted between two cycles. The hold space, on the other hand, keeps |
| its data between cycles (see commands @samp{h}, @samp{H}, @samp{x}, |
| @samp{g}, @samp{G} to move data between both buffers). |
| |
| |
| @node Addresses |
| @section Selecting lines with @command{sed} |
| @cindex Addresses, in @command{sed} scripts |
| @cindex Line selection |
| @cindex Selecting lines to process |
| |
| Addresses in a @command{sed} script can be in any of the following forms: |
| @table @code |
| @item @var{number} |
| @cindex Address, numeric |
| @cindex Line, selecting by number |
| Specifying a line number will match only that line in the input. |
| (Note that @command{sed} counts lines continuously across all input files |
| unless @option{-i} or @option{-s} options are specified.) |
| |
| @item @var{first}~@var{step} |
| @cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses |
| This @acronym{GNU} extension matches every @var{step}th line |
| starting with line @var{first}. |
| In particular, lines will be selected when there exists |
| a non-negative @var{n} such that the current line-number equals |
| @var{first} + (@var{n} * @var{step}). |
| Thus, to select the odd-numbered lines, |
| one would use @code{1~2}; |
| to pick every third line starting with the second, @samp{2~3} would be used; |
| to pick every fifth line starting with the tenth, use @samp{10~5}; |
| and @samp{50~0} is just an obscure way of saying @code{50}. |
| |
| @item $ |
| @cindex Address, last line |
| @cindex Last line, selecting |
| @cindex Line, selecting last |
| This address matches the last line of the last file of input, or |
| the last line of each file when the @option{-i} or @option{-s} options |
| are specified. |
| |
| @item /@var{regexp}/ |
| @cindex Address, as a regular expression |
| @cindex Line, selecting by regular expression match |
| This will select any line which matches the regular expression @var{regexp}. |
| If @var{regexp} itself includes any @code{/} characters, |
| each must be escaped by a backslash (@code{\}). |
| |
| @cindex empty regular expression |
| @cindex @value{SSEDEXT}, modifiers and the empty regular expression |
| The empty regular expression @samp{//} repeats the last regular |
| expression match (the same holds if the empty regular expression is |
| passed to the @code{s} command). Note that modifiers to regular expressions |
| are evaluated when the regular expression is compiled, thus it is invalid to |
| specify them together with the empty regular expression. |
| |
| @item \%@var{regexp}% |
| (The @code{%} may be replaced by any other single character.) |
| |
| @cindex Slash character, in regular expressions |
| This also matches the regular expression @var{regexp}, |
| but allows one to use a different delimiter than @code{/}. |
| This is particularly useful if the @var{regexp} itself contains |
| a lot of slashes, since it avoids the tedious escaping of every @code{/}. |
| If @var{regexp} itself includes any delimiter characters, |
| each must be escaped by a backslash (@code{\}). |
| |
| @item /@var{regexp}/I |
| @itemx \%@var{regexp}%I |
| @cindex @acronym{GNU} extensions, @code{I} modifier |
| @ifset PERL |
| @cindex Perl-style regular expressions, case-insensitive |
| @end ifset |
| The @code{I} modifier to regular-expression matching is a @acronym{GNU} |
| extension which causes the @var{regexp} to be matched in |
| a case-insensitive manner. |
| |
| @item /@var{regexp}/M |
| @itemx \%@var{regexp}%M |
| @ifset PERL |
| @cindex @value{SSEDEXT}, @code{M} modifier |
| @end ifset |
| @cindex Perl-style regular expressions, multiline |
| The @code{M} modifier to regular-expression matching is a @value{SSED} |
| extension which causes @code{^} and @code{$} to match respectively |
| (in addition to the normal behavior) the empty string after a newline, |
| and the empty string before a newline. There are special character |
| sequences |
| @ifset PERL |
| (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} |
| in basic or extended regular expression modes) |
| @end ifset |
| @ifclear PERL |
| (@code{\`} and @code{\'}) |
| @end ifclear |
| which always match the beginning or the end of the buffer. |
| @code{M} stands for @cite{multi-line}. |
| |
| @ifset PERL |
| @item /@var{regexp}/S |
| @itemx \%@var{regexp}%S |
| @cindex @value{SSEDEXT}, @code{S} modifier |
| @cindex Perl-style regular expressions, single line |
| The @code{S} modifier to regular-expression matching is only valid |
| in Perl mode and specifies that the dot character (@code{.}) will |
| match the newline character too. @code{S} stands for @cite{single-line}. |
| @end ifset |
| |
| @ifset PERL |
| @item /@var{regexp}/X |
| @itemx \%@var{regexp}%X |
| @cindex @value{SSEDEXT}, @code{X} modifier |
| @cindex Perl-style regular expressions, extended |
| The @code{X} modifier to regular-expression matching is also |
| valid in Perl mode only. If it is used, whitespace in the |
| pattern (other than in a character class) and |
| characters between a @kbd{#} outside a character class and the |
| next newline character are ignored. An escaping backslash |
| can be used to include a whitespace or @kbd{#} character as part |
| of the pattern. |
| @end ifset |
| @end table |
| |
| If no addresses are given, then all lines are matched; |
| if one address is given, then only lines matching that |
| address are matched. |
| |
| @cindex Range of lines |
| @cindex Several lines, selecting |
| An address range can be specified by specifying two addresses |
| separated by a comma (@code{,}). An address range matches lines |
| starting from where the first address matches, and continues |
| until the second address matches (inclusively). |
| |
| If the second address is a @var{regexp}, then checking for the |
| ending match will start with the line @emph{following} the |
| line which matched the first address: a range will always |
| span at least two lines (except of course if the input stream |
| ends). |
| |
| If the second address is a @var{number} less than (or equal to) |
| the line matching the first address, then only the one line is |
| matched. |
| |
| @cindex Special addressing forms |
| @cindex Range with start address of zero |
| @cindex Zero, as range start address |
| @cindex @var{addr1},+N |
| @cindex @var{addr1},~N |
| @cindex @acronym{GNU} extensions, special two-address forms |
| @cindex @acronym{GNU} extensions, @code{0} address |
| @cindex @acronym{GNU} extensions, 0,@var{addr2} addressing |
| @cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing |
| @cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing |
| @value{SSED} also supports some special two-address forms; all these |
| are @acronym{GNU} extensions: |
| @table @code |
| @item 0,/@var{regexp}/ |
| A line number of @code{0} can be used in an address specification like |
| @code{0,/@var{regexp}/} so that @command{sed} will try to match |
| @var{regexp} in the first input line too. In other words, |
| @code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/}, |
| except that if @var{addr2} matches the very first line of input the |
| @code{0,/@var{regexp}/} form will consider it to end the range, whereas |
| the @code{1,/@var{regexp}/} form will match the beginning of its range and |
| hence make the range span up to the @emph{second} occurrence of the |
| regular expression. |
| |
| Note that this is the only place where the @code{0} address makes |
| sense; there is no 0-th line and commands which are given the @code{0} |
| address in any other way will give an error. |
| |
| @item @var{addr1},+@var{N} |
| Matches @var{addr1} and the @var{N} lines following @var{addr1}. |
| |
| @item @var{addr1},~@var{N} |
| Matches @var{addr1} and the lines following @var{addr1} |
| until the next line whose input line number is a multiple of @var{N}. |
| @end table |
| |
| @cindex Excluding lines |
| @cindex Selecting non-matching lines |
| Appending the @code{!} character to the end of an address |
| specification negates the sense of the match. |
| That is, if the @code{!} character follows an address range, |
| then only lines which do @emph{not} match the address range |
| will be selected. |
| This also works for singleton addresses, |
| and, perhaps perversely, for the null address. |
| |
| |
| @node Regular Expressions |
| @section Overview of Regular Expression Syntax |
| |
| To know how to use @command{sed}, people should understand regular |
| expressions (@dfn{regexp} for short). A regular expression |
| is a pattern that is matched against a |
| subject string from left to right. Most characters are |
| @dfn{ordinary}: they stand for |
| themselves in a pattern, and match the corresponding characters |
| in the subject. As a trivial example, the pattern |
| |
| @example |
| The quick brown fox |
| @end example |
| |
| @noindent |
| matches a portion of a subject string that is identical to |
| itself. The power of regular expressions comes from the |
| ability to include alternatives and repetitions in the pattern. |
| These are encoded in the pattern by the use of @dfn{special characters}, |
| which do not stand for themselves but instead |
| are interpreted in some special way. Here is a brief description |
| of regular expression syntax as used in @command{sed}. |
| |
| @table @code |
| @item @var{char} |
| A single ordinary character matches itself. |
| |
| @item * |
| @cindex @acronym{GNU} extensions, to basic regular expressions |
| Matches a sequence of zero or more instances of matches for the |
| preceding regular expression, which must be an ordinary character, a |
| special character preceded by @code{\}, a @code{.}, a grouped regexp |
| (see below), or a bracket expression. As a @acronym{GNU} extension, a |
| postfixed regular expression can also be followed by @code{*}; for |
| example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX} |
| 1003.1-2001 says that @code{*} stands for itself when it appears at |
| the start of a regular expression or subexpression, but many |
| non@acronym{GNU} implementations do not support this and portable |
| scripts should instead use @code{\*} in these contexts. |
| |
| @item \+ |
| @cindex @acronym{GNU} extensions, to basic regular expressions |
| As @code{*}, but matches one or more. It is a @acronym{GNU} extension. |
| |
| @item \? |
| @cindex @acronym{GNU} extensions, to basic regular expressions |
| As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension. |
| |
| @item \@{@var{i}\@} |
| As @code{*}, but matches exactly @var{i} sequences (@var{i} is a |
| decimal integer; for portability, keep it between 0 and 255 |
| inclusive). |
| |
| @item \@{@var{i},@var{j}\@} |
| Matches between @var{i} and @var{j}, inclusive, sequences. |
| |
| @item \@{@var{i},\@} |
| Matches more than or equal to @var{i} sequences. |
| |
| @item \(@var{regexp}\) |
| Groups the inner @var{regexp} as a whole, this is used to: |
| |
| @itemize @bullet |
| @item |
| @cindex @acronym{GNU} extensions, to basic regular expressions |
| Apply postfix operators, like @code{\(abcd\)*}: |
| this will search for zero or more whole sequences |
| of @samp{abcd}, while @code{abcd*} would search |
| for @samp{abc} followed by zero or more occurrences |
| of @samp{d}. Note that support for @code{\(abcd\)*} is |
| required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU} |
| implementations do not support it and hence it is not universally |
| portable. |
| |
| @item |
| Use back references (see below). |
| @end itemize |
| |
| @item . |
| Matches any character, including newline. |
| |
| @item ^ |
| Matches the null string at beginning of the pattern space, i.e. what |
| appears after the circumflex must appear at the beginning of the |
| pattern space. |
| |
| In most scripts, pattern space is initialized to the content of each |
| line (@pxref{Execution Cycle, , How @code{sed} works}). So, it is a |
| useful simplification to think of @code{^#include} as matching only |
| lines where @samp{#include} is the first thing on line---if there are |
| spaces before, for example, the match fails. This simplification is |
| valid as long as the original content of pattern space is not modified, |
| for example with an @code{s} command. |
| |
| @code{^} acts as a special character only at the beginning of the |
| regular expression or subexpression (that is, after @code{\(} or |
| @code{\|}). Portable scripts should avoid @code{^} at the beginning of |
| a subexpression, though, as @acronym{POSIX} allows implementations that |
| treat @code{^} as an ordinary character in that context. |
| |
| @item $ |
| It is the same as @code{^}, but refers to end of pattern space. |
| @code{$} also acts as a special character only at the end |
| of the regular expression or subexpression (that is, before @code{\)} |
| or @code{\|}), and its use at the end of a subexpression is not |
| portable. |
| |
| |
| @item [@var{list}] |
| @itemx [^@var{list}] |
| Matches any single character in @var{list}: for example, |
| @code{[aeiou]} matches all vowels. A list may include |
| sequences like @code{@var{char1}-@var{char2}}, which |
| matches any character between (inclusive) @var{char1} |
| and @var{char2}. |
| |
| A leading @code{^} reverses the meaning of @var{list}, so that |
| it matches any single character @emph{not} in @var{list}. To include |
| @code{]} in the list, make it the first character (after |
| the @code{^} if needed), to include @code{-} in the list, |
| make it the first or last; to include @code{^} put |
| it after the first character. |
| |
| @cindex @code{POSIXLY_CORRECT} behavior, bracket expressions |
| The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\} |
| are normally not special within @var{list}. For example, @code{[\*]} |
| matches either @samp{\} or @samp{*}, because the @code{\} is not |
| special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and |
| @code{[:space:]} are special within @var{list} and represent collating |
| symbols, equivalence classes, and character classes, respectively, and |
| @code{[} is therefore special within @var{list} when it is followed by |
| @code{.}, @code{=}, or @code{:}. Also, when not in |
| @env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and |
| @code{\t} are recognized within @var{list}. @xref{Escapes}. |
| |
| @item @var{regexp1}\|@var{regexp2} |
| @cindex @acronym{GNU} extensions, to basic regular expressions |
| Matches either @var{regexp1} or @var{regexp2}. Use |
| parentheses to use complex alternative regular expressions. |
| The matching process tries each alternative in turn, from |
| left to right, and the first one that succeeds is used. |
| It is a @acronym{GNU} extension. |
| |
| @item @var{regexp1}@var{regexp2} |
| Matches the concatenation of @var{regexp1} and @var{regexp2}. |
| Concatenation binds more tightly than @code{\|}, @code{^}, and |
| @code{$}, but less tightly than the other regular expression |
| operators. |
| |
| @item \@var{digit} |
| Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized |
| subexpression in the regular expression. This is called a @dfn{back |
| reference}. Subexpressions are implicity numbered by counting |
| occurrences of @code{\(} left-to-right. |
| |
| @item \n |
| Matches the newline character. |
| |
| @item \@var{char} |
| Matches @var{char}, where @var{char} is one of @code{$}, |
| @code{*}, @code{.}, @code{[}, @code{\}, or @code{^}. |
| Note that the only C-like |
| backslash sequences that you can portably assume to be |
| interpreted are @code{\n} and @code{\\}; in particular |
| @code{\t} is not portable, and matches a @samp{t} under most |
| implementations of @command{sed}, rather than a tab character. |
| |
| @end table |
| |
| @cindex Greedy regular expression matching |
| Note that the regular expression matcher is greedy, i.e., matches |
| are attempted from left to right and, if two or more matches are |
| possible starting at the same character, it selects the longest. |
| |
| @noindent |
| Examples: |
| @table @samp |
| @item abcdef |
| Matches @samp{abcdef}. |
| |
| @item a*b |
| Matches zero or more @samp{a}s followed by a single |
| @samp{b}. For example, @samp{b} or @samp{aaaaab}. |
| |
| @item a\?b |
| Matches @samp{b} or @samp{ab}. |
| |
| @item a\+b\+ |
| Matches one or more @samp{a}s followed by one or more |
| @samp{b}s: @samp{ab} is the shortest possible match, but |
| other examples are @samp{aaaab} or @samp{abbbbb} or |
| @samp{aaaaaabbbbbbb}. |
| |
| @item .* |
| @itemx .\+ |
| These two both match all the characters in a string; |
| however, the first matches every string (including the empty |
| string), while the second matches only strings containing |
| at least one character. |
| |
| @item ^main.*(.*) |
| his matches a string starting with @samp{main}, |
| followed by an opening and closing |
| parenthesis. The @samp{n}, @samp{(} and @samp{)} need not |
| be adjacent. |
| |
| @item ^# |
| This matches a string beginning with @samp{#}. |
| |
| @item \\$ |
| This matches a string ending with a single backslash. The |
| regexp contains two backslashes for escaping. |
| |
| @item \$ |
| Instead, this matches a string consisting of a single dollar sign, |
| because it is escaped. |
| |
| @item [a-zA-Z0-9] |
| In the C locale, this matches any @acronym{ASCII} letters or digits. |
| |
| @item [^ @kbd{tab}]\+ |
| (Here @kbd{tab} stands for a single tab character.) |
| This matches a string of one or more |
| characters, none of which is a space or a tab. |
| Usually this means a word. |
| |
| @item ^\(.*\)\n\1$ |
| This matches a string consisting of two equal substrings separated by |
| a newline. |
| |
| @item .\@{9\@}A$ |
| This matches nine characters followed by an @samp{A}. |
| |
| @item ^.\@{15\@}A |
| This matches the start of a string that contains 16 characters, |
| the last of which is an @samp{A}. |
| |
| @end table |
| |
| |
| |
| @node Common Commands |
| @section Often-Used Commands |
| |
| If you use @command{sed} at all, you will quite likely want to know |
| these commands. |
| |
| @table @code |
| @item # |
| [No addresses allowed.] |
| |
| @findex # (comments) |
| @cindex Comments, in scripts |
| The @code{#} character begins a comment; |
| the comment continues until the next newline. |
| |
| @cindex Portability, comments |
| If you are concerned about portability, be aware that |
| some implementations of @command{sed} (which are not @sc{posix} |
| conformant) may only support a single one-line comment, |
| and then only when the very first character of the script is a @code{#}. |
| |
| @findex -n, forcing from within a script |
| @cindex Caveat --- #n on first line |
| Warning: if the first two characters of the @command{sed} script |
| are @code{#n}, then the @option{-n} (no-autoprint) option is forced. |
| If you want to put a comment in the first line of your script |
| and that comment begins with the letter @samp{n} |
| and you do not want this behavior, |
| then be sure to either use a capital @samp{N}, |
| or place at least one space before the @samp{n}. |
| |
| @item q [@var{exit-code}] |
| This command only accepts a single address. |
| |
| @findex q (quit) command |
| @cindex @value{SSEDEXT}, returning an exit code |
| @cindex Quitting |
| Exit @command{sed} without processing any more commands or input. |
| Note that the current pattern space is printed if auto-print is |
| not disabled with the @option{-n} options. The ability to return |
| an exit code from the @command{sed} script is a @value{SSED} extension. |
| |
| @item d |
| @findex d (delete) command |
| @cindex Text, deleting |
| Delete the pattern space; |
| immediately start next cycle. |
| |
| @item p |
| @findex p (print) command |
| @cindex Text, printing |
| Print out the pattern space (to the standard output). |
| This command is usually only used in conjunction with the @option{-n} |
| command-line option. |
| |
| @item n |
| @findex n (next-line) command |
| @cindex Next input line, replace pattern space with |
| @cindex Read next input line |
| If auto-print is not disabled, print the pattern space, |
| then, regardless, replace the pattern space with the next line of input. |
| If there is no more input then @command{sed} exits without processing |
| any more commands. |
| |
| @item @{ @var{commands} @} |
| @findex @{@} command grouping |
| @cindex Grouping commands |
| @cindex Command groups |
| A group of commands may be enclosed between |
| @code{@{} and @code{@}} characters. |
| This is particularly useful when you want a group of commands |
| to be triggered by a single address (or address-range) match. |
| |
| @end table |
| |
| @node The "s" Command |
| @section The @code{s} Command |
| |
| The syntax of the @code{s} (as in substitute) command is |
| @samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/} |
| characters may be uniformly replaced by any other single |
| character within any given @code{s} command. The @code{/} |
| character (or whatever other character is used in its stead) |
| can appear in the @var{regexp} or @var{replacement} |
| only if it is preceded by a @code{\} character. |
| |
| The @code{s} command is probably the most important in @command{sed} |
| and has a lot of different options. Its basic concept is simple: |
| the @code{s} command attempts to match the pattern |
| space against the supplied @var{regexp}; if the match is |
| successful, then that portion of the pattern |
| space which was matched is replaced with @var{replacement}. |
| |
| @cindex Backreferences, in regular expressions |
| @cindex Parenthesized substrings |
| The @var{replacement} can contain @code{\@var{n}} (@var{n} being |
| a number from 1 to 9, inclusive) references, which refer to |
| the portion of the match which is contained between the @var{n}th |
| @code{\(} and its matching @code{\)}. |
| Also, the @var{replacement} can contain unescaped @code{&} |
| characters which reference the whole matched portion |
| of the pattern space. |
| @cindex @value{SSEDEXT}, case modifiers in @code{s} commands |
| Finally, as a @value{SSED} extension, you can include a |
| special sequence made of a backslash and one of the letters |
| @code{L}, @code{l}, @code{U}, @code{u}, or @code{E}. |
| The meaning is as follows: |
| |
| @table @code |
| @item \L |
| Turn the replacement |
| to lowercase until a @code{\U} or @code{\E} is found, |
| |
| @item \l |
| Turn the |
| next character to lowercase, |
| |
| @item \U |
| Turn the replacement to uppercase |
| until a @code{\L} or @code{\E} is found, |
| |
| @item \u |
| Turn the next character |
| to uppercase, |
| |
| @item \E |
| Stop case conversion started by @code{\L} or @code{\U}. |
| @end table |
| |
| To include a literal @code{\}, @code{&}, or newline in the final |
| replacement, be sure to precede the desired @code{\}, @code{&}, |
| or newline in the @var{replacement} with a @code{\}. |
| |
| @findex s command, option flags |
| @cindex Substitution of text, options |
| The @code{s} command can be followed by zero or more of the |
| following @var{flags}: |
| |
| @table @code |
| @item g |
| @cindex Global substitution |
| @cindex Replacing all text matching regexp in a line |
| Apply the replacement to @emph{all} matches to the @var{regexp}, |
| not just the first. |
| |
| @item @var{number} |
| @cindex Replacing only @var{n}th match of regexp in a line |
| Only replace the @var{number}th match of the @var{regexp}. |
| |
| @cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command |
| @cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command |
| Note: the @sc{posix} standard does not specify what should happen |
| when you mix the @code{g} and @var{number} modifiers, |
| and currently there is no widely agreed upon meaning |
| across @command{sed} implementations. |
| For @value{SSED}, the interaction is defined to be: |
| ignore matches before the @var{number}th, |
| and then match and replace all matches from |
| the @var{number}th on. |
| |
| @item p |
| @cindex Text, printing after substitution |
| If the substitution was made, then print the new pattern space. |
| |
| Note: when both the @code{p} and @code{e} options are specified, |
| the relative ordering of the two produces very different results. |
| In general, @code{ep} (evaluate then print) is what you want, |
| but operating the other way round can be useful for debugging. |
| For this reason, the current version of @value{SSED} interprets |
| specially the presence of @code{p} options both before and after |
| @code{e}, printing the pattern space before and after evaluation, |
| while in general flags for the @code{s} command show their |
| effect just once. This behavior, although documented, might |
| change in future versions. |
| |
| @item w @var{file-name} |
| @cindex Text, writing to a file after substitution |
| @cindex @value{SSEDEXT}, @file{/dev/stdout} file |
| @cindex @value{SSEDEXT}, @file{/dev/stderr} file |
| If the substitution was made, then write out the result to the named file. |
| As a @value{SSED} extension, two special values of @var{file-name} are |
| supported: @file{/dev/stderr}, which writes the result to the standard |
| error, and @file{/dev/stdout}, which writes to the standard |
| output.@footnote{This is equivalent to @code{p} unless the @option{-i} |
| option is being used.} |
| |
| @item e |
| @cindex Evaluate Bourne-shell commands, after substitution |
| @cindex Subprocesses |
| @cindex @value{SSEDEXT}, evaluating Bourne-shell commands |
| @cindex @value{SSEDEXT}, subprocesses |
| This command allows one to pipe input from a shell command |
| into pattern space. If a substitution was made, the command |
| that is found in pattern space is executed and pattern space |
| is replaced with its output. A trailing newline is suppressed; |
| results are undefined if the command to be executed contains |
| a @sc{nul} character. This is a @value{SSED} extension. |
| |
| @item I |
| @itemx i |
| @cindex @acronym{GNU} extensions, @code{I} modifier |
| @cindex Case-insensitive matching |
| @ifset PERL |
| @cindex Perl-style regular expressions, case-insensitive |
| @end ifset |
| The @code{I} modifier to regular-expression matching is a @acronym{GNU} |
| extension which makes @command{sed} match @var{regexp} in a |
| case-insensitive manner. |
| |
| @item M |
| @itemx m |
| @cindex @value{SSEDEXT}, @code{M} modifier |
| @ifset PERL |
| @cindex Perl-style regular expressions, multiline |
| @end ifset |
| The @code{M} modifier to regular-expression matching is a @value{SSED} |
| extension which causes @code{^} and @code{$} to match respectively |
| (in addition to the normal behavior) the empty string after a newline, |
| and the empty string before a newline. There are special character |
| sequences |
| @ifset PERL |
| (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} |
| in basic or extended regular expression modes) |
| @end ifset |
| @ifclear PERL |
| (@code{\`} and @code{\'}) |
| @end ifclear |
| which always match the beginning or the end of the buffer. |
| @code{M} stands for @cite{multi-line}. |
| |
| @ifset PERL |
| @item S |
| @itemx s |
| @cindex @value{SSEDEXT}, @code{S} modifier |
| @cindex Perl-style regular expressions, single line |
| The @code{S} modifier to regular-expression matching is only valid |
| in Perl mode and specifies that the dot character (@code{.}) will |
| match the newline character too. @code{S} stands for @cite{single-line}. |
| @end ifset |
| |
| @ifset PERL |
| @item X |
| @itemx x |
| @cindex @value{SSEDEXT}, @code{X} modifier |
| @cindex Perl-style regular expressions, extended |
| The @code{X} modifier to regular-expression matching is also |
| valid in Perl mode only. If it is used, whitespace in the |
| pattern (other than in a character class) and |
| characters between a @kbd{#} outside a character class and the |
| next newline character are ignored. An escaping backslash |
| can be used to include a whitespace or @kbd{#} character as part |
| of the pattern. |
| @end ifset |
| @end table |
| |
| |
| @node Other Commands |
| @section Less Frequently-Used Commands |
| |
| Though perhaps less frequently used than those in the previous |
| section, some very small yet useful @command{sed} scripts can be built with |
| these commands. |
| |
| @table @code |
| @item y/@var{source-chars}/@var{dest-chars}/ |
| (The @code{/} characters may be uniformly replaced by |
| any other single character within any given @code{y} command.) |
| |
| @findex y (transliterate) command |
| @cindex Transliteration |
| Transliterate any characters in the pattern space which match |
| any of the @var{source-chars} with the corresponding character |
| in @var{dest-chars}. |
| |
| Instances of the @code{/} (or whatever other character is used in its stead), |
| @code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars} |
| lists, provide that each instance is escaped by a @code{\}. |
| The @var{source-chars} and @var{dest-chars} lists @emph{must} |
| contain the same number of characters (after de-escaping). |
| |
| @item a\ |
| @itemx @var{text} |
| @cindex @value{SSEDEXT}, two addresses supported by most commands |
| As a @acronym{GNU} extension, this command accepts two addresses. |
| |
| @findex a (append text lines) command |
| @cindex Appending text after a line |
| @cindex Text, appending |
| Queue the lines of text which follow this command |
| (each but the last ending with a @code{\}, |
| which are removed from the output) |
| to be output at the end of the current cycle, |
| or when the next input line is read. |
| |
| Escape sequences in @var{text} are processed, so you should |
| use @code{\\} in @var{text} to print a single backslash. |
| |
| As a @acronym{GNU} extension, if between the @code{a} and the newline there is |
| other than a whitespace-@code{\} sequence, then the text of this line, |
| starting at the first non-whitespace character after the @code{a}, |
| is taken as the first line of the @var{text} block. |
| (This enables a simplification in scripting a one-line add.) |
| This extension also works with the @code{i} and @code{c} commands. |
| |
| @item i\ |
| @itemx @var{text} |
| @cindex @value{SSEDEXT}, two addresses supported by most commands |
| As a @acronym{GNU} extension, this command accepts two addresses. |
| |
| @findex i (insert text lines) command |
| @cindex Inserting text before a line |
| @cindex Text, insertion |
| Immediately output the lines of text which follow this command |
| (each but the last ending with a @code{\}, |
| which are removed from the output). |
| |
| @item c\ |
| @itemx @var{text} |
| @findex c (change to text lines) command |
| @cindex Replacing selected lines with other text |
| Delete the lines matching the address or address-range, |
| and output the lines of text which follow this command |
| (each but the last ending with a @code{\}, |
| which are removed from the output) |
| in place of the last line |
| (or in place of each line, if no addresses were specified). |
| A new cycle is started after this command is done, |
| since the pattern space will have been deleted. |
| |
| @item = |
| @cindex @value{SSEDEXT}, two addresses supported by most commands |
| As a @acronym{GNU} extension, this command accepts two addresses. |
| |
| @findex = (print line number) command |
| @cindex Printing line number |
| @cindex Line number, printing |
| Print out the current input line number (with a trailing newline). |
| |
| @item l @var{n} |
| @findex l (list unambiguously) command |
| @cindex List pattern space |
| @cindex Printing text unambiguously |
| @cindex Line length, setting |
| @cindex @value{SSEDEXT}, setting line length |
| Print the pattern space in an unambiguous form: |
| non-printable characters (and the @code{\} character) |
| are printed in C-style escaped form; long lines are split, |
| with a trailing @code{\} character to indicate the split; |
| the end of each line is marked with a @code{$}. |
| |
| @var{n} specifies the desired line-wrap length; |
| a length of 0 (zero) means to never wrap long lines. If omitted, |
| the default as specified on the command line is used. The @var{n} |
| parameter is a @value{SSED} extension. |
| |
| @item r @var{filename} |
| @cindex @value{SSEDEXT}, two addresses supported by most commands |
| As a @acronym{GNU} extension, this command accepts two addresses. |
| |
| @findex r (read file) command |
| @cindex Read text from a file |
| @cindex @value{SSEDEXT}, @file{/dev/stdin} file |
| Queue the contents of @var{filename} to be read and |
| inserted into the output stream at the end of the current cycle, |
| or when the next input line is read. |
| Note that if @var{filename} cannot be read, it is treated as |
| if it were an empty file, without any error indication. |
| |
| As a @value{SSED} extension, the special value @file{/dev/stdin} |
| is supported for the file name, which reads the contents of the |
| standard input. |
| |
| @item w @var{filename} |
| @findex w (write file) command |
| @cindex Write to a file |
| @cindex @value{SSEDEXT}, @file{/dev/stdout} file |
| @cindex @value{SSEDEXT}, @file{/dev/stderr} file |
| Write the pattern space to @var{filename}. |
| As a @value{SSED} extension, two special values of @var{file-name} are |
| supported: @file{/dev/stderr}, which writes the result to the standard |
| error, and @file{/dev/stdout}, which writes to the standard |
| output.@footnote{This is equivalent to @code{p} unless the @option{-i} |
| option is being used.} |
| |
| The file will be created (or truncated) before the |
| first input line is read; all @code{w} commands |
| (including instances of @code{w} flag on successful @code{s} commands) |
| which refer to the same @var{filename} are output without |
| closing and reopening the file. |
| |
| @item D |
| @findex D (delete first line) command |
| @cindex Delete first line from pattern space |
| Delete text in the pattern space up to the first newline. |
| If any text is left, restart cycle with the resultant |
| pattern space (without reading a new line of input), |
| otherwise start a normal new cycle. |
| |
| @item N |
| @findex N (append Next line) command |
| @cindex Next input line, append to pattern space |
| @cindex Append next input line to pattern space |
| Add a newline to the pattern space, |
| then append the next line of input to the pattern space. |
| If there is no more input then @command{sed} exits without processing |
| any more commands. |
| |
| @item P |
| @findex P (print first line) command |
| @cindex Print first line from pattern space |
| Print out the portion of the pattern space up to the first newline. |
| |
| @item h |
| @findex h (hold) command |
| @cindex Copy pattern space into hold space |
| @cindex Replace hold space with copy of pattern space |
| @cindex Hold space, copying pattern space into |
| Replace the contents of the hold space with the contents of the pattern space. |
| |
| @item H |
| @findex H (append Hold) command |
| @cindex Append pattern space to hold space |
| @cindex Hold space, appending from pattern space |
| Append a newline to the contents of the hold space, |
| and then append the contents of the pattern space to that of the hold space. |
| |
| @item g |
| @findex g (get) command |
| @cindex Copy hold space into pattern space |
| @cindex Replace pattern space with copy of hold space |
| @cindex Hold space, copy into pattern space |
| Replace the contents of the pattern space with the contents of the hold space. |
| |
| @item G |
| @findex G (appending Get) command |
| @cindex Append hold space to pattern space |
| @cindex Hold space, appending to pattern space |
| Append a newline to the contents of the pattern space, |
| and then append the contents of the hold space to that of the pattern space. |
| |
| @item x |
| @findex x (eXchange) command |
| @cindex Exchange hold space with pattern space |
| @cindex Hold space, exchange with pattern space |
| Exchange the contents of the hold and pattern spaces. |
| |
| @end table |
| |
| |
| @node Programming Commands |
| @section Commands for @command{sed} gurus |
| |
| In most cases, use of these commands indicates that you are |
| probably better off programming in something like @command{awk} |
| or Perl. But occasionally one is committed to sticking |
| with @command{sed}, and these commands can enable one to write |
| quite convoluted scripts. |
| |
| @cindex Flow of control in scripts |
| @table @code |
| @item : @var{label} |
| [No addresses allowed.] |
| |
| @findex : (label) command |
| @cindex Labels, in scripts |
| Specify the location of @var{label} for branch commands. |
| In all other respects, a no-op. |
| |
| @item b @var{label} |
| @findex b (branch) command |
| @cindex Branch to a label, unconditionally |
| @cindex Goto, in scripts |
| Unconditionally branch to @var{label}. |
| The @var{label} may be omitted, in which case the next cycle is started. |
| |
| @item t @var{label} |
| @findex t (test and branch if successful) command |
| @cindex Branch to a label, if @code{s///} succeeded |
| @cindex Conditional branch |
| Branch to @var{label} only if there has been a successful @code{s}ubstitution |
| since the last input line was read or conditional branch was taken. |
| The @var{label} may be omitted, in which case the next cycle is started. |
| |
| @end table |
| |
| @node Extended Commands |
| @section Commands Specific to @value{SSED} |
| |
| These commands are specific to @value{SSED}, so you |
| must use them with care and only when you are sure that |
| hindering portability is not evil. They allow you to check |
| for @value{SSED} extensions or to do tasks that are required |
| quite often, yet are unsupported by standard @command{sed}s. |
| |
| @table @code |
| @item e [@var{command}] |
| @findex e (evaluate) command |
| @cindex Evaluate Bourne-shell commands |
| @cindex Subprocesses |
| @cindex @value{SSEDEXT}, evaluating Bourne-shell commands |
| @cindex @value{SSEDEXT}, subprocesses |
| This command allows one to pipe input from a shell command |
| into pattern space. Without parameters, the @code{e} command |
| executes the command that is found in pattern space and |
| replaces the pattern space with the output; a trailing newline |
| is suppressed. |
| |
| If a parameter is specified, instead, the @code{e} command |
| interprets it as a command and sends its output to the output stream |
| (like @code{r} does). The command can run across multiple |
| lines, all but the last ending with a back-slash. |
| |
| In both cases, the results are undefined if the command to be |
| executed contains a @sc{nul} character. |
| |
| @item L @var{n} |
| @findex L (fLow paragraphs) command |
| @cindex Reformat pattern space |
| @cindex Reformatting paragraphs |
| @cindex @value{SSEDEXT}, reformatting paragraphs |
| @cindex @value{SSEDEXT}, @code{L} command |
| This @value{SSED} extension fills and joins lines in pattern space |
| to produce output lines of (at most) @var{n} characters, like |
| @code{fmt} does; if @var{n} is omitted, the default as specified |
| on the command line is used. This command is considered a failed |
| experiment and unless there is enough request (which seems unlikely) |
| will be removed in future versions. |
| |
| @ignore |
| Blank lines, spaces between words, and indentation are |
| preserved in the output; successive input lines with different |
| indentation are not joined; tabs are expanded to 8 columns. |
| |
| If the pattern space contains multiple lines, they are joined, but |
| since the pattern space usually contains a single line, the behavior |
| of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e., |
| it does not join short lines to form longer ones). |
| |
| @var{n} specifies the desired line-wrap length; if omitted, |
| the default as specified on the command line is used. |
| @end ignore |
| |
| @item Q [@var{exit-code}] |
| This command only accepts a single address. |
| |
| @findex Q (silent Quit) command |
| @cindex @value{SSEDEXT}, quitting silently |
| @cindex @value{SSEDEXT}, returning an exit code |
| @cindex Quitting |
| This command is the same as @code{q}, but will not print the |
| contents of pattern space. Like @code{q}, it provides the |
| ability to return an exit code to the caller. |
| |
| This command can be useful because the only alternative ways |
| to accomplish this apparently trivial function are to use |
| the @option{-n} option (which can unnecessarily complicate |
| your script) or resorting to the following snippet, which |
| wastes time by reading the whole file without any visible effect: |
| |
| @example |
| :eat |
| $d @i{@r{Quit silently on the last line}} |
| N @i{@r{Read another line, silently}} |
| g @i{@r{Overwrite pattern space each time to save memory}} |
| b eat |
| @end example |
| |
| @item R @var{filename} |
| @findex R (read line) command |
| @cindex Read text from a file |
| @cindex @value{SSEDEXT}, reading a file a line at a time |
| @cindex @value{SSEDEXT}, @code{R} command |
| @cindex @value{SSEDEXT}, @file{/dev/stdin} file |
| Queue a line of @var{filename} to be read and |
| inserted into the output stream at the end of the current cycle, |
| or when the next input line is read. |
| Note that if @var{filename} cannot be read, or if its end is |
| reached, no line is appended, without any error indication. |
| |
| As with the @code{r} command, the special value @file{/dev/stdin} |
| is supported for the file name, which reads a line from the |
| standard input. |
| |
| @item T @var{label} |
| @findex T (test and branch if failed) command |
| @cindex @value{SSEDEXT}, branch if @code{s///} failed |
| @cindex Branch to a label, if @code{s///} failed |
| @cindex Conditional branch |
| Branch to @var{label} only if there have been no successful |
| @code{s}ubstitutions since the last input line was read or |
| conditional branch was taken. The @var{label} may be omitted, |
| in which case the next cycle is started. |
| |
| @item v @var{version} |
| @findex v (version) command |
| @cindex @value{SSEDEXT}, checking for their presence |
| @cindex Requiring @value{SSED} |
| This command does nothing, but makes @command{sed} fail if |
| @value{SSED} extensions are not supported, simply because other |
| versions of @command{sed} do not implement it. In addition, you |
| can specify the version of @command{sed} that your script |
| requires, such as @code{4.0.5}. The default is @code{4.0} |
| because that is the first version that implemented this command. |
| |
| This command enables all @value{SSEDEXT} even if |
| @env{POSIXLY_CORRECT} is set in the environment. |
| |
| @item W @var{filename} |
| @findex W (write first line) command |
| @cindex Write first line to a file |
| @cindex @value{SSEDEXT}, writing first line to a file |
| Write to the given filename the portion of the pattern space up to |
| the first newline. Everything said under the @code{w} command about |
| file handling holds here too. |
| |
| @item z |
| @findex z (Zap) command |
| @cindex @value{SSEDEXT}, emptying pattern space |
| @cindex Emptying pattern space |
| This command empties the content of pattern space. It is |
| usually the same as @samp{s/.*//}, but is more efficient |
| and works in the presence of invalid multibyte sequences |
| in the input stream. @sc{posix} mandates that such sequences |
| are @emph{not} matched by @samp{.}, so that there is no portable |
| way to clear @command{sed}'s buffers in the middle of the |
| script in most multibyte locales (including UTF-8 locales). |
| @end table |
| |
| @node Escapes |
| @section @acronym{GNU} Extensions for Escapes in Regular Expressions |
| |
| @cindex @acronym{GNU} extensions, special escapes |
| Until this chapter, we have only encountered escapes of the form |
| @samp{\^}, which tell @command{sed} not to interpret the circumflex |
| as a special character, but rather to take it literally. For |
| example, @samp{\*} matches a single asterisk rather than zero |
| or more backslashes. |
| |
| @cindex @code{POSIXLY_CORRECT} behavior, escapes |
| This chapter introduces another kind of escape@footnote{All |
| the escapes introduced here are @acronym{GNU} |
| extensions, with the exception of @code{\n}. In basic regular |
| expression mode, setting @code{POSIXLY_CORRECT} disables them inside |
| bracket expressions.}---that |
| is, escapes that are applied to a character or sequence of characters |
| that ordinarily are taken literally, and that @command{sed} replaces |
| with a special character. This provides a way |
| of encoding non-printable characters in patterns in a visible manner. |
| There is no restriction on the appearance of non-printing characters |
| in a @command{sed} script but when a script is being prepared in the |
| shell or by text editing, it is usually easier to use one of |
| the following escape sequences than the binary character it |
| represents: |
| |
| The list of these escapes is: |
| |
| @table @code |
| @item \a |
| Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7). |
| |
| @item \f |
| Produces or matches a form feed (@sc{ascii} 12). |
| |
| @item \n |
| Produces or matches a newline (@sc{ascii} 10). |
| |
| @item \r |
| Produces or matches a carriage return (@sc{ascii} 13). |
| |
| @item \t |
| Produces or matches a horizontal tab (@sc{ascii} 9). |
| |
| @item \v |
| Produces or matches a so called ``vertical tab'' (@sc{ascii} 11). |
| |
| @item \c@var{x} |
| Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is |
| any character. The precise effect of @samp{\c@var{x}} is as follows: |
| if @var{x} is a lower case letter, it is converted to upper case. |
| Then bit 6 of the character (hex 40) is inverted. Thus @samp{\cz} becomes |
| hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B. |
| |
| @item \d@var{xxx} |
| Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}. |
| |
| @item \o@var{xxx} |
| @ifset PERL |
| @item \@var{xxx} |
| @end ifset |
| Produces or matches a character whose octal @sc{ascii} value is @var{xxx}. |
| @ifset PERL |
| The syntax without the @code{o} is active in Perl mode, while the one |
| with the @code{o} is active in the normal or extended @sc{posix} regular |
| expression modes. |
| @end ifset |
| |
| @item \x@var{xx} |
| Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}. |
| @end table |
| |
| @samp{\b} (backspace) was omitted because of the conflict with |
| the existing ``word boundary'' meaning. |
| |
| Other escapes match a particular character class and are valid only in |
| regular expressions: |
| |
| @table @code |
| @item \w |
| Matches any ``word'' character. A ``word'' character is any |
| letter or digit or the underscore character. |
| |
| @item \W |
| Matches any ``non-word'' character. |
| |
| @item \b |
| Matches a word boundary; that is it matches if the character |
| to the left is a ``word'' character and the character to the |
| right is a ``non-word'' character, or vice-versa. |
| |
| @item \B |
| Matches everywhere but on a word boundary; that is it matches |
| if the character to the left and the character to the right |
| are either both ``word'' characters or both ``non-word'' |
| characters. |
| |
| @item \` |
| Matches only at the start of pattern space. This is different |
| from @code{^} in multi-line mode. |
| |
| @item \' |
| Matches only at the end of pattern space. This is different |
| from @code{$} in multi-line mode. |
| |
| @ifset PERL |
| @item \G |
| Match only at the start of pattern space or, when doing a global |
| substitution using the @code{s///g} command and option, at |
| the end-of-match position of the prior match. For example, |
| @samp{s/\Ga/Z/g} will change an initial run of @code{a}s to |
| a run of @code{Z}s |
| @end ifset |
| @end table |
| |
| @node Examples |
| @chapter Some Sample Scripts |
| |
| Here are some @command{sed} scripts to guide you in the art of mastering |
| @command{sed}. |
| |
| @menu |
| Some exotic examples: |
| * Centering lines:: |
| * Increment a number:: |
| * Rename files to lower case:: |
| * Print bash environment:: |
| * Reverse chars of lines:: |
| |
| Emulating standard utilities: |
| * tac:: Reverse lines of files |
| * cat -n:: Numbering lines |
| * cat -b:: Numbering non-blank lines |
| * wc -c:: Counting chars |
| * wc -w:: Counting words |
| * wc -l:: Counting lines |
| * head:: Printing the first lines |
| * tail:: Printing the last lines |
| * uniq:: Make duplicate lines unique |
| * uniq -d:: Print duplicated lines of input |
| * uniq -u:: Remove all duplicated lines |
| * cat -s:: Squeezing blank lines |
| @end menu |
| |
| @node Centering lines |
| @section Centering Lines |
| |
| This script centers all lines of a file on a 80 columns width. |
| To change that width, the number in @code{\@{@dots{}\@}} must be |
| replaced, and the number of added spaces also must be changed. |
| |
| Note how the buffer commands are used to separate parts in |
| the regular expressions to be matched---this is a common |
| technique. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -f |
| |
| # Put 80 spaces in the buffer |
| 1 @{ |
| x |
| s/^$/ / |
| s/^.*$/&&&&&&&&/ |
| x |
| @} |
| |
| # del leading and trailing spaces |
| y/@kbd{tab}/ / |
| s/^ *// |
| s/ *$// |
| |
| # add a newline and 80 spaces to end of line |
| G |
| |
| # keep first 81 chars (80 + a newline) |
| s/^\(.\@{81\@}\).*$/\1/ |
| |
| # \2 matches half of the spaces, which are moved to the beginning |
| s/^\(.*\)\n\(.*\)\2/\2\1/ |
| @end example |
| @c end--------------------------------------------- |
| |
| @node Increment a number |
| @section Increment a Number |
| |
| This script is one of a few that demonstrate how to do arithmetic |
| in @command{sed}. This is indeed possible,@footnote{@command{sed} guru Greg |
| Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator! |
| It is distributed together with sed.} but must be done manually. |
| |
| To increment one number you just add 1 to last digit, replacing |
| it by the following digit. There is one exception: when the digit |
| is a nine the previous digits must be also incremented until you |
| don't have a nine. |
| |
| This solution by Bruno Haible is very clever and smart because |
| it uses a single buffer; if you don't have this limitation, the |
| algorithm used in @ref{cat -n, Numbering lines}, is faster. |
| It works by replacing trailing nines with an underscore, then |
| using multiple @code{s} commands to increment the last digit, |
| and then again substituting underscores with zeros. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -f |
| |
| /[^0-9]/ d |
| |
| # replace all leading 9s by _ (any other character except digits, could |
| # be used) |
| :d |
| s/9\(_*\)$/_\1/ |
| td |
| |
| # incr last digit only. The first line adds a most-significant |
| # digit of 1 if we have to add a digit. |
| # |
| # The @code{tn} commands are not necessary, but make the thing |
| # faster |
| |
| s/^\(_*\)$/1\1/; tn |
| s/8\(_*\)$/9\1/; tn |
| s/7\(_*\)$/8\1/; tn |
| s/6\(_*\)$/7\1/; tn |
| s/5\(_*\)$/6\1/; tn |
| s/4\(_*\)$/5\1/; tn |
| s/3\(_*\)$/4\1/; tn |
| s/2\(_*\)$/3\1/; tn |
| s/1\(_*\)$/2\1/; tn |
| s/0\(_*\)$/1\1/; tn |
| |
| :n |
| y/_/0/ |
| @end example |
| @c end--------------------------------------------- |
| |
| @node Rename files to lower case |
| @section Rename Files to Lower Case |
| |
| This is a pretty strange use of @command{sed}. We transform text, and |
| transform it to be shell commands, then just feed them to shell. |
| Don't worry, even worse hacks are done when using @command{sed}; I have |
| seen a script converting the output of @command{date} into a @command{bc} |
| program! |
| |
| The main body of this is the @command{sed} script, which remaps the name |
| from lower to upper (or vice-versa) and even checks out |
| if the remapped name is the same as the original name. |
| Note how the script is parameterized using shell |
| variables and proper quoting. |
| |
| @c start------------------------------------------- |
| @example |
| #! /bin/sh |
| # rename files to lower/upper case... |
| # |
| # usage: |
| # move-to-lower * |
| # move-to-upper * |
| # or |
| # move-to-lower -R . |
| # move-to-upper -R . |
| # |
| |
| help() |
| @{ |
| cat << eof |
| Usage: $0 [-n] [-r] [-h] files... |
| |
| -n do nothing, only see what would be done |
| -R recursive (use find) |
| -h this message |
| files files to remap to lower case |
| |
| Examples: |
| $0 -n * (see if everything is ok, then...) |
| $0 * |
| |
| $0 -R . |
| |
| eof |
| @} |
| |
| apply_cmd='sh' |
| finder='echo "$@@" | tr " " "\n"' |
| files_only= |
| |
| while : |
| do |
| case "$1" in |
| -n) apply_cmd='cat' ;; |
| -R) finder='find "$@@" -type f';; |
| -h) help ; exit 1 ;; |
| *) break ;; |
| esac |
| shift |
| done |
| |
| if [ -z "$1" ]; then |
| echo Usage: $0 [-h] [-n] [-r] files... |
| exit 1 |
| fi |
| |
| LOWER='abcdefghijklmnopqrstuvwxyz' |
| UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ' |
| |
| case `basename $0` in |
| *upper*) TO=$UPPER; FROM=$LOWER ;; |
| *) FROM=$UPPER; TO=$LOWER ;; |
| esac |
| |
| eval $finder | sed -n ' |
| |
| # remove all trailing slashes |
| s/\/*$// |
| |
| # add ./ if there is no path, only a filename |
| /\//! s/^/.\// |
| |
| # save path+filename |
| h |
| |
| # remove path |
| s/.*\/// |
| |
| # do conversion only on filename |
| y/'$FROM'/'$TO'/ |
| |
| # now line contains original path+file, while |
| # hold space contains the new filename |
| x |
| |
| # add converted file name to line, which now contains |
| # path/file-name\nconverted-file-name |
| G |
| |
| # check if converted file name is equal to original file name, |
| # if it is, do not print nothing |
| /^.*\/\(.*\)\n\1/b |
| |
| # now, transform path/fromfile\n, into |
| # mv path/fromfile path/tofile and print it |
| s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p |
| |
| ' | $apply_cmd |
| @end example |
| @c end--------------------------------------------- |
| |
| @node Print bash environment |
| @section Print @command{bash} Environment |
| |
| This script strips the definition of the shell functions |
| from the output of the @command{set} Bourne-shell command. |
| |
| @c start------------------------------------------- |
| @example |
| #!/bin/sh |
| |
| set | sed -n ' |
| :x |
| |
| @ifinfo |
| # if no occurrence of "=()" print and load next line |
| @end ifinfo |
| @ifnotinfo |
| # if no occurrence of @samp{=()} print and load next line |
| @end ifnotinfo |
| /=()/! @{ p; b; @} |
| / () $/! @{ p; b; @} |
| |
| # possible start of functions section |
| # save the line in case this is a var like FOO="() " |
| h |
| |
| # if the next line has a brace, we quit because |
| # nothing comes after functions |
| n |
| /^@{/ q |
| |
| # print the old line |
| x; p |
| |
| # work on the new line now |
| x; bx |
| ' |
| @end example |
| @c end--------------------------------------------- |
| |
| @node Reverse chars of lines |
| @section Reverse Characters of Lines |
| |
| This script can be used to reverse the position of characters |
| in lines. The technique moves two characters at a time, hence |
| it is faster than more intuitive implementations. |
| |
| Note the @code{tx} command before the definition of the label. |
| This is often needed to reset the flag that is tested by |
| the @code{t} command. |
| |
| Imaginative readers will find uses for this script. An example |
| is reversing the output of @command{banner}.@footnote{This requires |
| another script to pad the output of banner; for example |
| |
| @example |
| #! /bin/sh |
| |
| banner -w $1 $2 $3 $4 | |
| sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' | |
| ~/sedscripts/reverseline.sed |
| @end example |
| } |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -f |
| |
| /../! b |
| |
| # Reverse a line. Begin embedding the line between two newlines |
| s/^.*$/\ |
| &\ |
| / |
| |
| # Move first character at the end. The regexp matches until |
| # there are zero or one characters between the markers |
| tx |
| :x |
| s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/ |
| tx |
| |
| # Remove the newline markers |
| s/\n//g |
| @end example |
| @c end--------------------------------------------- |
| |
| @node tac |
| @section Reverse Lines of Files |
| |
| This one begins a series of totally useless (yet interesting) |
| scripts emulating various Unix commands. This, in particular, |
| is a @command{tac} workalike. |
| |
| Note that on implementations other than @acronym{GNU} @command{sed} |
| @ifset PERL |
| and @value{SSED} |
| @end ifset |
| this script might easily overflow internal buffers. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -nf |
| |
| # reverse all lines of input, i.e. first line became last, ... |
| |
| # from the second line, the buffer (which contains all previous lines) |
| # is *appended* to current line, so, the order will be reversed |
| 1! G |
| |
| # on the last line we're done -- print everything |
| $ p |
| |
| # store everything on the buffer again |
| h |
| @end example |
| @c end--------------------------------------------- |
| |
| @node cat -n |
| @section Numbering Lines |
| |
| This script replaces @samp{cat -n}; in fact it formats its output |
| exactly like @acronym{GNU} @command{cat} does. |
| |
| Of course this is completely useless and for two reasons: first, |
| because somebody else did it in C, second, because the following |
| Bourne-shell script could be used for the same purpose and would |
| be much faster: |
| |
| @c start------------------------------------------- |
| @example |
| #! /bin/sh |
| sed -e "=" $@@ | sed -e ' |
| s/^/ / |
| N |
| s/^ *\(......\)\n/\1 / |
| ' |
| @end example |
| @c end--------------------------------------------- |
| |
| It uses @command{sed} to print the line number, then groups lines two |
| by two using @code{N}. Of course, this script does not teach as much as |
| the one presented below. |
| |
| The algorithm used for incrementing uses both buffers, so the line |
| is printed as soon as possible and then discarded. The number |
| is split so that changing digits go in a buffer and unchanged ones go |
| in the other; the changed digits are modified in a single step |
| (using a @code{y} command). The line number for the next line |
| is then composed and stored in the hold space, to be used in the |
| next iteration. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -nf |
| |
| # Prime the pump on the first line |
| x |
| /^$/ s/^.*$/1/ |
| |
| # Add the correct line number before the pattern |
| G |
| h |
| |
| # Format it and print it |
| s/^/ / |
| s/^ *\(......\)\n/\1 /p |
| |
| # Get the line number from hold space; add a zero |
| # if we're going to add a digit on the next line |
| g |
| s/\n.*$// |
| /^9*$/ s/^/0/ |
| |
| # separate changing/unchanged digits with an x |
| s/.9*$/x&/ |
| |
| # keep changing digits in hold space |
| h |
| s/^.*x// |
| y/0123456789/1234567890/ |
| x |
| |
| # keep unchanged digits in pattern space |
| s/x.*$// |
| |
| # compose the new number, remove the newline implicitly added by G |
| G |
| s/\n// |
| h |
| @end example |
| @c end--------------------------------------------- |
| |
| @node cat -b |
| @section Numbering Non-blank Lines |
| |
| Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only |
| have to select which lines are to be numbered and which are not. |
| |
| The part that is common to this script and the previous one is |
| not commented to show how important it is to comment @command{sed} |
| scripts properly... |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -nf |
| |
| /^$/ @{ |
| p |
| b |
| @} |
| |
| # Same as cat -n from now |
| x |
| /^$/ s/^.*$/1/ |
| G |
| h |
| s/^/ / |
| s/^ *\(......\)\n/\1 /p |
| x |
| s/\n.*$// |
| /^9*$/ s/^/0/ |
| s/.9*$/x&/ |
| h |
| s/^.*x// |
| y/0123456789/1234567890/ |
| x |
| s/x.*$// |
| G |
| s/\n// |
| h |
| @end example |
| @c end--------------------------------------------- |
| |
| @node wc -c |
| @section Counting Characters |
| |
| This script shows another way to do arithmetic with @command{sed}. |
| In this case we have to add possibly large numbers, so implementing |
| this by successive increments would not be feasible (and possibly |
| even more complicated to contrive than this script). |
| |
| The approach is to map numbers to letters, kind of an abacus |
| implemented with @command{sed}. @samp{a}s are units, @samp{b}s are |
| tens and so on: we simply add the number of characters |
| on the current line as units, and then propagate the carry |
| to tens, hundreds, and so on. |
| |
| As usual, running totals are kept in hold space. |
| |
| On the last line, we convert the abacus form back to decimal. |
| For the sake of variety, this is done with a loop rather than |
| with some 80 @code{s} commands@footnote{Some implementations |
| have a limit of 199 commands per script}: first we |
| convert units, removing @samp{a}s from the number; then we |
| rotate letters so that tens become @samp{a}s, and so on |
| until no more letters remain. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -nf |
| |
| # Add n+1 a's to hold space (+1 is for the newline) |
| s/./a/g |
| H |
| x |
| s/\n/a/ |
| |
| # Do the carry. The t's and b's are not necessary, |
| # but they do speed up the thing |
| t a |
| : a; s/aaaaaaaaaa/b/g; t b; b done |
| : b; s/bbbbbbbbbb/c/g; t c; b done |
| : c; s/cccccccccc/d/g; t d; b done |
| : d; s/dddddddddd/e/g; t e; b done |
| : e; s/eeeeeeeeee/f/g; t f; b done |
| : f; s/ffffffffff/g/g; t g; b done |
| : g; s/gggggggggg/h/g; t h; b done |
| : h; s/hhhhhhhhhh//g |
| |
| : done |
| $! @{ |
| h |
| b |
| @} |
| |
| # On the last line, convert back to decimal |
| |
| : loop |
| /a/! s/[b-h]*/&0/ |
| s/aaaaaaaaa/9/ |
| s/aaaaaaaa/8/ |
| s/aaaaaaa/7/ |
| s/aaaaaa/6/ |
| s/aaaaa/5/ |
| s/aaaa/4/ |
| s/aaa/3/ |
| s/aa/2/ |
| s/a/1/ |
| |
| : next |
| y/bcdefgh/abcdefg/ |
| /[a-h]/ b loop |
| p |
| @end example |
| @c end--------------------------------------------- |
| |
| @node wc -w |
| @section Counting Words |
| |
| This script is almost the same as the previous one, once each |
| of the words on the line is converted to a single @samp{a} |
| (in the previous script each letter was changed to an @samp{a}). |
| |
| It is interesting that real @command{wc} programs have optimized |
| loops for @samp{wc -c}, so they are much slower at counting |
| words rather than characters. This script's bottleneck, |
| instead, is arithmetic, and hence the word-counting one |
| is faster (it has to manage smaller numbers). |
| |
| Again, the common parts are not commented to show the importance |
| of commenting @command{sed} scripts. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -nf |
| |
| # Convert words to a's |
| s/[ @kbd{tab}][ @kbd{tab}]*/ /g |
| s/^/ / |
| s/ [^ ][^ ]*/a /g |
| s/ //g |
| |
| # Append them to hold space |
| H |
| x |
| s/\n// |
| |
| # From here on it is the same as in wc -c. |
| /aaaaaaaaaa/! bx; s/aaaaaaaaaa/b/g |
| /bbbbbbbbbb/! bx; s/bbbbbbbbbb/c/g |
| /cccccccccc/! bx; s/cccccccccc/d/g |
| /dddddddddd/! bx; s/dddddddddd/e/g |
| /eeeeeeeeee/! bx; s/eeeeeeeeee/f/g |
| /ffffffffff/! bx; s/ffffffffff/g/g |
| /gggggggggg/! bx; s/gggggggggg/h/g |
| s/hhhhhhhhhh//g |
| :x |
| $! @{ h; b; @} |
| :y |
| /a/! s/[b-h]*/&0/ |
| s/aaaaaaaaa/9/ |
| s/aaaaaaaa/8/ |
| s/aaaaaaa/7/ |
| s/aaaaaa/6/ |
| s/aaaaa/5/ |
| s/aaaa/4/ |
| s/aaa/3/ |
| s/aa/2/ |
| s/a/1/ |
| y/bcdefgh/abcdefg/ |
| /[a-h]/ by |
| p |
| @end example |
| @c end--------------------------------------------- |
| |
| @node wc -l |
| @section Counting Lines |
| |
| No strange things are done now, because @command{sed} gives us |
| @samp{wc -l} functionality for free!!! Look: |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -nf |
| $= |
| @end example |
| @c end--------------------------------------------- |
| |
| @node head |
| @section Printing the First Lines |
| |
| This script is probably the simplest useful @command{sed} script. |
| It displays the first 10 lines of input; the number of displayed |
| lines is right before the @code{q} command. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -f |
| 10q |
| @end example |
| @c end--------------------------------------------- |
| |
| @node tail |
| @section Printing the Last Lines |
| |
| Printing the last @var{n} lines rather than the first is more complex |
| but indeed possible. @var{n} is encoded in the second line, before |
| the bang character. |
| |
| This script is similar to the @command{tac} script in that it keeps the |
| final output in the hold space and prints it at the end: |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -nf |
| |
| 1! @{; H; g; @} |
| 1,10 !s/[^\n]*\n// |
| $p |
| h |
| @end example |
| @c end--------------------------------------------- |
| |
| Mainly, the scripts keeps a window of 10 lines and slides it |
| by adding a line and deleting the oldest (the substitution command |
| on the second line works like a @code{D} command but does not |
| restart the loop). |
| |
| The ``sliding window'' technique is a very powerful way to write |
| efficient and complex @command{sed} scripts, because commands like |
| @code{P} would require a lot of work if implemented manually. |
| |
| To introduce the technique, which is fully demonstrated in the |
| rest of this chapter and is based on the @code{N}, @code{P} |
| and @code{D} commands, here is an implementation of @command{tail} |
| using a simple ``sliding window.'' |
| |
| This looks complicated but in fact the working is the same as |
| the last script: after we have kicked in the appropriate number |
| of lines, however, we stop using the hold space to keep inter-line |
| state, and instead use @code{N} and @code{D} to slide pattern |
| space by one line: |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -f |
| |
| 1h |
| 2,10 @{; H; g; @} |
| $q |
| 1,9d |
| N |
| D |
| @end example |
| @c end--------------------------------------------- |
| |
| Note how the first, second and fourth line are inactive after |
| the first ten lines of input. After that, all the script does |
| is: exiting on the last line of input, appending the next input |
| line to pattern space, and removing the first line. |
| |
| @node uniq |
| @section Make Duplicate Lines Unique |
| |
| This is an example of the art of using the @code{N}, @code{P} |
| and @code{D} commands, probably the most difficult to master. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -f |
| h |
| |
| :b |
| # On the last line, print and exit |
| $b |
| N |
| /^\(.*\)\n\1$/ @{ |
| # The two lines are identical. Undo the effect of |
| # the n command. |
| g |
| bb |
| @} |
| |
| # If the @code{N} command had added the last line, print and exit |
| $b |
| |
| # The lines are different; print the first and go |
| # back working on the second. |
| P |
| D |
| @end example |
| @c end--------------------------------------------- |
| |
| As you can see, we mantain a 2-line window using @code{P} and @code{D}. |
| This technique is often used in advanced @command{sed} scripts. |
| |
| @node uniq -d |
| @section Print Duplicated Lines of Input |
| |
| This script prints only duplicated lines, like @samp{uniq -d}. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -nf |
| |
| $b |
| N |
| /^\(.*\)\n\1$/ @{ |
| # Print the first of the duplicated lines |
| s/.*\n// |
| p |
| |
| # Loop until we get a different line |
| :b |
| $b |
| N |
| /^\(.*\)\n\1$/ @{ |
| s/.*\n// |
| bb |
| @} |
| @} |
| |
| # The last line cannot be followed by duplicates |
| $b |
| |
| # Found a different one. Leave it alone in the pattern space |
| # and go back to the top, hunting its duplicates |
| D |
| @end example |
| @c end--------------------------------------------- |
| |
| @node uniq -u |
| @section Remove All Duplicated Lines |
| |
| This script prints only unique lines, like @samp{uniq -u}. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -f |
| |
| # Search for a duplicate line --- until that, print what you find. |
| $b |
| N |
| /^\(.*\)\n\1$/ ! @{ |
| P |
| D |
| @} |
| |
| :c |
| # Got two equal lines in pattern space. At the |
| # end of the file we simply exit |
| $d |
| |
| # Else, we keep reading lines with @code{N} until we |
| # find a different one |
| s/.*\n// |
| N |
| /^\(.*\)\n\1$/ @{ |
| bc |
| @} |
| |
| # Remove the last instance of the duplicate line |
| # and go back to the top |
| D |
| @end example |
| @c end--------------------------------------------- |
| |
| @node cat -s |
| @section Squeezing Blank Lines |
| |
| As a final example, here are three scripts, of increasing complexity |
| and speed, that implement the same function as @samp{cat -s}, that is |
| squeezing blank lines. |
| |
| The first leaves a blank line at the beginning and end if there are |
| some already. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -f |
| |
| # on empty lines, join with next |
| # Note there is a star in the regexp |
| :x |
| /^\n*$/ @{ |
| N |
| bx |
| @} |
| |
| # now, squeeze all '\n', this can be also done by: |
| # s/^\(\n\)*/\1/ |
| s/\n*/\ |
| / |
| @end example |
| @c end--------------------------------------------- |
| |
| This one is a bit more complex and removes all empty lines |
| at the beginning. It does leave a single blank line at end |
| if one was there. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -f |
| |
| # delete all leading empty lines |
| 1,/^./@{ |
| /./!d |
| @} |
| |
| # on an empty line we remove it and all the following |
| # empty lines, but one |
| :x |
| /./!@{ |
| N |
| s/^\n$// |
| tx |
| @} |
| @end example |
| @c end--------------------------------------------- |
| |
| This removes leading and trailing blank lines. It is also the |
| fastest. Note that loops are completely done with @code{n} and |
| @code{b}, without relying on @command{sed} to restart the |
| the script automatically at the end of a line. |
| |
| @c start------------------------------------------- |
| @example |
| #!/usr/bin/sed -nf |
| |
| # delete all (leading) blanks |
| /./!d |
| |
| # get here: so there is a non empty |
| :x |
| # print it |
| p |
| # get next |
| n |
| # got chars? print it again, etc... |
| /./bx |
| |
| # no, don't have chars: got an empty line |
| :z |
| # get next, if last line we finish here so no trailing |
| # empty lines are written |
| n |
| # also empty? then ignore it, and get next... this will |
| # remove ALL empty lines |
| /./!bz |
| |
| # all empty lines were deleted/ignored, but we have a non empty. As |
| # what we want to do is to squeeze, insert a blank line artificially |
| i\ |
| |
| bx |
| @end example |
| @c end--------------------------------------------- |
| |
| @node Limitations |
| @chapter @value{SSED}'s Limitations and Non-limitations |
| |
| @cindex @acronym{GNU} extensions, unlimited line length |
| @cindex Portability, line length limitations |
| For those who want to write portable @command{sed} scripts, |
| be aware that some implementations have been known to |
| limit line lengths (for the pattern and hold spaces) |
| to be no more than 4000 bytes. |
| The @sc{posix} standard specifies that conforming @command{sed} |
| implementations shall support at least 8192 byte line lengths. |
| @value{SSED} has no built-in limit on line length; |
| as long as it can @code{malloc()} more (virtual) memory, |
| you can feed or construct lines as long as you like. |
| |
| However, recursion is used to handle subpatterns and indefinite |
| repetition. This means that the available stack space may limit |
| the size of the buffer that can be processed by certain patterns. |
| |
| @ifset PERL |
| There are some size limitations in the regular expression |
| matcher but it is hoped that they will never in practice |
| be relevant. The maximum length of a compiled pattern |
| is 65539 (sic) bytes. All values in repeating quantifiers |
| must be less than 65536. The maximum nesting depth of |
| all parenthesized subpatterns, including capturing and |
| non-capturing subpatterns@footnote{The |
| distinction is meaningful when referring to Perl-style |
| regular expressions.}, assertions, and other types of |
| subpattern, is 200. |
| |
| Also, @value{SSED} recognizes the @sc{posix} syntax |
| @code{[.@var{ch}.]} and @code{[=@var{ch}=]} |
| where @var{ch} is a ``collating element'', but these |
| are not supported, and an error is given if they are |
| encountered. |
| |
| Here are a few distinctions between the real Perl-style |
| regular expressions and those that @option{-R} recognizes. |
| |
| @enumerate |
| @item |
| Lookahead assertions do not allow repeat quantifiers after them |
| Perl permits them, but they do not mean what you |
| might think. For example, @samp{(?!a)@{3@}} does not assert that the |
| next three characters are not @samp{a}. It just asserts three times that the |
| next character is not @samp{a} --- a waste of time and nothing else. |
| |
| @item |
| Capturing subpatterns that occur inside negative lookahead |
| head assertions are counted, but their entries are counted |
| as empty in the second half of an @code{s} command. |
| Perl sets its numerical variables from any such patterns |
| that are matched before the assertion fails to match |
| something (thereby succeeding), but only if the negative |
| lookahead assertion contains just one branch. |
| |
| @item |
| The following Perl escape sequences are not supported: |
| @samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E}, |
| @samp{\Q}. In fact these are implemented by Perl's general |
| string-handling and are not part of its pattern matching engine. |
| |
| @item |
| The Perl @samp{\G} assertion is not supported as it is not |
| relevant to single pattern matches. |
| |
| @item |
| Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})} |
| and @samp{(?p@{code@})} constructions. However, there is some experimental |
| support for recursive patterns using the non-Perl item @samp{(?R)}. |
| |
| @item |
| There are at the time of writing some oddities in Perl |
| 5.005_02 concerned with the settings of captured strings |
| when part of a pattern is repeated. For example, matching |
| @samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets |
| @samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.} |
| to the value @samp{b}, but matching @samp{aabbaa} |
| against @samp{/^(aa(bb)?)+$/} leaves @samp{$2} |
| unset. However, if the pattern is changed to |
| @samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set. |
| In Perl 5.004 @samp{$2} is set in both cases, and that is also |
| true of @value{SSED}. |
| |
| @item |
| Another as yet unresolved discrepancy is that in Perl |
| 5.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches |
| the string @samp{a}, whereas in @value{SSED} it does not. |
| However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched |
| against @samp{a} leaves $1 unset. |
| @end enumerate |
| @end ifset |
| |
| @node Other Resources |
| @chapter Other Resources for Learning About @command{sed} |
| |
| @cindex Additional reading about @command{sed} |
| In addition to several books that have been written about @command{sed} |
| (either specifically or as chapters in books which discuss |
| shell programming), one can find out more about @command{sed} |
| (including suggestions of a few books) from the FAQ |
| for the @code{sed-users} mailing list, available from: |
| @display |
| @uref{http://sed.sourceforge.net/sedfaq.html} |
| @end display |
| |
| Also of interest are |
| @uref{http://www.student.northpark.edu/pemente/sed/index.htm} |
| and @uref{http://sed.sf.net/grabbag}, |
| which include @command{sed} tutorials and other @command{sed}-related goodies. |
| |
| The @code{sed-users} mailing list itself maintained by Sven Guckes. |
| To subscribe, visit @uref{http://groups.yahoo.com} and search |
| for the @code{sed-users} mailing list. |
| |
| @node Reporting Bugs |
| @chapter Reporting Bugs |
| |
| @cindex Bugs, reporting |
| Email bug reports to @email{bonzini@@gnu.org}. |
| Be sure to include the word ``sed'' somewhere in the @code{Subject:} field. |
| Also, please include the output of @samp{sed --version} in the body |
| of your report if at all possible. |
| |
| Please do not send a bug report like this: |
| |
| @example |
| @i{@i{@r{while building frobme-1.3.4}}} |
| $ configure |
| @error{} sed: file sedscr line 1: Unknown option to 's' |
| @end example |
| |
| If @value{SSED} doesn't configure your favorite package, take a |
| few extra minutes to identify the specific problem and make a stand-alone |
| test case. Unlike other programs such as C compilers, making such test |
| cases for @command{sed} is quite simple. |
| |
| A stand-alone test case includes all the data necessary to perform the |
| test, and the specific invocation of @command{sed} that causes the problem. |
| The smaller a stand-alone test case is, the better. A test case should |
| not involve something as far removed from @command{sed} as ``try to configure |
| frobme-1.3.4''. Yes, that is in principle enough information to look |
| for the bug, but that is not a very practical prospect. |
| |
| Here are a few commonly reported bugs that are not bugs. |
| |
| @table @asis |
| @item @code{N} command on the last line |
| @cindex Portability, @code{N} command on the last line |
| @cindex Non-bugs, @code{N} command on the last line |
| |
| Most versions of @command{sed} exit without printing anything when |
| the @command{N} command is issued on the last line of a file. |
| @value{SSED} prints pattern space before exiting unless of course |
| the @command{-n} command switch has been specified. This choice is |
| by design. |
| |
| For example, the behavior of |
| @example |
| sed N foo bar |
| @end example |
| @noindent |
| would depend on whether foo has an even or an odd number of |
| lines@footnote{which is the actual ``bug'' that prompted the |
| change in behavior}. Or, when writing a script to read the |
| next few lines following a pattern match, traditional |
| implementations of @code{sed} would force you to write |
| something like |
| @example |
| /foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @} |
| @end example |
| @noindent |
| instead of just |
| @example |
| /foo/@{ N;N;N;N;N;N;N;N;N; @} |
| @end example |
| |
| @cindex @code{POSIXLY_CORRECT} behavior, @code{N} command |
| In any case, the simplest workaround is to use @code{$d;N} in |
| scripts that rely on the traditional behavior, or to set |
| the @code{POSIXLY_CORRECT} variable to a non-empty value. |
| |
| @item Regex syntax clashes (problems with backslashes) |
| @cindex @acronym{GNU} extensions, to basic regular expressions |
| @cindex Non-bugs, regex syntax clashes |
| @command{sed} uses the @sc{posix} basic regular expression syntax. According to |
| the standard, the meaning of some escape sequences is undefined in |
| this syntax; notable in the case of @command{sed} are @code{\|}, |
| @code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<}, |
| @code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}. |
| |
| As in all @acronym{GNU} programs that use @sc{posix} basic regular |
| expressions, @command{sed} interprets these escape sequences as special |
| characters. So, @code{x\+} matches one or more occurrences of @samp{x}. |
| @code{abc\|def} matches either @samp{abc} or @samp{def}. |
| |
| This syntax may cause problems when running scripts written for other |
| @command{sed}s. Some @command{sed} programs have been written with the |
| assumption that @code{\|} and @code{\+} match the literal characters |
| @code{|} and @code{+}. Such scripts must be modified by removing the |
| spurious backslashes if they are to be used with modern implementations |
| of @command{sed}, like |
| @ifset PERL |
| @value{SSED} or |
| @end ifset |
| @acronym{GNU} @command{sed}. |
| |
| On the other hand, some scripts use s|abc\|def||g to remove occurrences |
| of @emph{either} @code{abc} or @code{def}. While this worked until |
| @command{sed} 4.0.x, newer versions interpret this as removing the |
| string @code{abc|def}. This is again undefined behavior according to |
| @acronym{POSIX}, and this interpretation is arguably more robust: older |
| @command{sed}s, for example, required that the regex matcher parsed |
| @code{\/} as @code{/} in the common case of escaping a slash, which is |
| again undefined behavior; the new behavior avoids this, and this is good |
| because the regex matcher is only partially under our control. |
| |
| @cindex @acronym{GNU} extensions, special escapes |
| In addition, this version of @command{sed} supports several escape characters |
| (some of which are multi-character) to insert non-printable characters |
| in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r}, |
| @code{\t}, @code{\v}, @code{\x}). These can cause similar problems |
| with scripts written for other @command{sed}s. |
| |
| @item @option{-i} clobbers read-only files |
| @cindex In-place editing |
| @cindex @value{SSEDEXT}, in-place editing |
| @cindex Non-bugs, in-place editing |
| |
| In short, @samp{sed -i} will let you delete the contents of |
| a read-only file, and in general the @option{-i} option |
| (@pxref{Invoking sed, , Invocation}) lets you clobber |
| protected files. This is not a bug, but rather a consequence |
| of how the Unix filesystem works. |
| |
| The permissions on a file say what can happen to the data |
| in that file, while the permissions on a directory say what can |
| happen to the list of files in that directory. @samp{sed -i} |
| will not ever open for writing a file that is already on disk. |
| Rather, it will work on a temporary file that is finally renamed |
| to the original name: if you rename or delete files, you're actually |
| modifying the contents of the directory, so the operation depends on |
| the permissions of the directory, not of the file. For this same |
| reason, @command{sed} does not let you use @option{-i} on a writeable file |
| in a read-only directory, and will break hard or symbolic links when |
| @option{-i} is used on such a file. |
| |
| @item @code{0a} does not work (gives an error) |
| @cindex @code{0} address |
| @cindex @acronym{GNU} extensions, @code{0} address |
| @cindex Non-bugs, @code{0} address |
| |
| There is no line 0. 0 is a special address that is only used to treat |
| addresses like @code{0,/@var{RE}/} as active when the script starts: if |
| you write @code{1,/abc/d} and the first line includes the word @samp{abc}, |
| then that match would be ignored because address ranges must span at least |
| two lines (barring the end of the file); but what you probably wanted is |
| to delete every line up to the first one including @samp{abc}, and this |
| is obtained with @code{0,/abc/d}. |
| |
| @ifclear PERL |
| @item @code{[a-z]} is case insensitive |
| @cindex Non-bugs, localization-related |
| |
| You are encountering problems with locales. POSIX mandates that @code{[a-z]} |
| uses the current locale's collation order -- in C parlance, that means using |
| @code{strcoll(3)} instead of @code{strcmp(3)}. Some locales have a |
| case-insensitive collation order, others don't. |
| |
| Another problem is that @code{[a-z]} tries to use collation symbols. |
| This only happens if you are on the @acronym{GNU} system, using |
| @acronym{GNU} libc's regular expression matcher instead of compiling the |
| one supplied with @acronym{GNU} sed. In a Danish locale, for example, |
| the regular expression @code{^[a-z]$} matches the string @samp{aa}, |
| because this is a single collating symbol that comes after @samp{a} |
| and before @samp{b}; @samp{ll} behaves similarly in Spanish |
| locales, or @samp{ij} in Dutch locales. |
| |
| To work around these problems, which may cause bugs in shell scripts, set |
| the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. |
| |
| @item @code{s/.*//} does not clear pattern space |
| @cindex Non-bugs, localization-related |
| @cindex @value{SSEDEXT}, emptying pattern space |
| @cindex Emptying pattern space |
| |
| This happens if your input stream includes invalid multibyte |
| sequences. @sc{posix} mandates that such sequences |
| are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear |
| pattern space as you would expect. In fact, there is no way to clear |
| sed's buffers in the middle of the script in most multibyte locales |
| (including UTF-8 locales). For this reason, @value{SSED} provides a `z' |
| command (for `zap') as an extension. |
| |
| To work around these problems, which may cause bugs in shell scripts, set |
| the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. |
| @end ifclear |
| @end table |
| |
| |
| @node Extended regexps |
| @appendix Extended regular expressions |
| @cindex Extended regular expressions, syntax |
| |
| The only difference between basic and extended regular expressions is in |
| the behavior of a few characters: @samp{?}, @samp{+}, parentheses, |
| and braces (@samp{@{@}}). While basic regular expressions require |
| these to be escaped if you want them to behave as special characters, |
| when using extended regular expressions you must escape them if |
| you want them @emph{to match a literal character}. |
| |
| @noindent |
| Examples: |
| @table @code |
| @item abc? |
| becomes @samp{abc\?} when using extended regular expressions. It matches |
| the literal string @samp{abc?}. |
| |
| @item c\+ |
| becomes @samp{c+} when using extended regular expressions. It matches |
| one or more @samp{c}s. |
| |
| @item a\@{3,\@} |
| becomes @samp{a@{3,@}} when using extended regular expressions. It matches |
| three or more @samp{a}s. |
| |
| @item \(abc\)\@{2,3\@} |
| becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It |
| matches either @samp{abcabc} or @samp{abcabcabc}. |
| |
| @item \(abc*\)\1 |
| becomes @samp{(abc*)\1} when using extended regular expressions. |
| Backreferences must still be escaped when using extended regular |
| expressions. |
| @end table |
| |
| @ifset PERL |
| @node Perl regexps |
| @appendix Perl-style regular expressions |
| @cindex Perl-style regular expressions, syntax |
| |
| @emph{This part is taken from the @file{pcre.txt} file distributed together |
| with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.} |
| |
| Perl introduced several extensions to regular expressions, some |
| of them incompatible with the syntax of regular expressions |
| accepted by Emacs and other @acronym{GNU} tools (whose matcher was |
| based on the Emacs matcher). @value{SSED} implements |
| both kinds of extensions. |
| |
| @iftex |
| Summarizing, we have: |
| |
| @itemize @bullet |
| @item |
| A backslash can introduce several special sequences |
| |
| @item |
| The circumflex, dollar sign, and period characters behave specially |
| with regard to new lines |
| |
| @item |
| Strange uses of square brackets are parsed differently |
| |
| @item |
| You can toggle modifiers in the middle of a regular expression |
| |
| @item |
| You can specify that a subpattern does not count when numbering backreferences |
| |
| @item |
| @cindex Greedy regular expression matching |
| You can specify greedy or non-greedy matching |
| |
| @item |
| You can have more than ten back references |
| |
| @item |
| You can do complex look aheads and look behinds (in the spirit of |
| @code{\b}, but with subpatterns). |
| |
| @item |
| You can often improve performance by avoiding that @command{sed} wastes |
| time with backtracking |
| |
| @item |
| You can have if/then/else branches |
| |
| @item |
| You can do recursive matches, for example to look for unbalanced parentheses |
| |
| @item |
| You can have comments and non-significant whitespace, because things can |
| get complex... |
| @end itemize |
| |
| Most of these extensions are introduced by the special @code{(?} |
| sequence, which gives special meanings to parenthesized groups. |
| @end iftex |
| @menu |
| Other extensions can be roughly subdivided in two categories |
| On one hand Perl introduces several more escaped sequences |
| (that is, sequences introduced by a backslash). On the other |
| hand, it specifies that if a question mark follows an open |
| parentheses it should give a special meaning to the parenthesized |
| group. |
| |
| * Backslash:: Introduces special sequences |
| * Circumflex/dollar sign/period:: Behave specially with regard to new lines |
| * Square brackets:: Are a bit different in strange cases |
| * Options setting:: Toggle modifiers in the middle of a regexp |
| * Non-capturing subpatterns:: Are not counted when backreferencing |
| * Repetition:: Allows for non-greedy matching |
| * Backreferences:: Allows for more than 10 back references |
| * Assertions:: Allows for complex look ahead matches |
| * Non-backtracking subpatterns:: Often gives more performance |
| * Conditional subpatterns:: Allows if/then/else branches |
| * Recursive patterns:: For example to match parentheses |
| * Comments:: Because things can get complex... |
| @end menu |
| |
| @node Backslash |
| @appendixsec Backslash |
| @cindex Perl-style regular expressions, escaped sequences |
| |
| There are a few difference in the handling of backslashed |
| sequences in Perl mode. |
| |
| First of all, there are no @code{\o} and @code{\d} sequences. |
| @sc{ascii} values for characters can be specified in octal |
| with a @code{\@var{xxx}} sequence, where @var{xxx} is a |
| sequence of up to three octal digits. If the first digit |
| is a zero, the treatment of the sequence is straightforward; |
| just note that if the character that follows the escaped digit |
| is itself an octal digit, you have to supply three octal digits |
| for @var{xxx}. For example @code{\07} is a @sc{bel} character |
| rather than a @sc{nul} and a literal @code{7} (this sequence is |
| instead represented by @code{\0007}). |
| |
| @cindex Perl-style regular expressions, backreferences |
| The handling of a backslash followed by a digit other than 0 |
| is complicated. Outside a character class, @command{sed} reads it |
| and any following digits as a decimal number. If the number |
| is less than 10, or if there have been at least that many |
| previous capturing left parentheses in the expression, the |
| entire sequence is taken as a back reference. A description |
| of how this works is given later, following the discussion |
| of parenthesized subpatterns. |
| |
| Inside a character class, or if the decimal number is |
| greater than 9 and there have not been that many capturing |
| subpatterns, @command{sed} re-reads up to three octal digits following |
| the backslash, and generates a single byte from the |
| least significant 8 bits of the value. Any subsequent digits |
| stand for themselves. For example: |
| |
| @example |
| \040 @i{@r{is another way of writing a space}} |
| \40 @i{@r{is the same, provided there are fewer than 40}} |
| @i{@r{previous capturing subpatterns}} |
| \7 @i{@r{is always a back reference}} |
| \011 @i{@r{is always a tab}} |
| \11 @i{@r{might be a back reference, or another way of writing a tab}} |
| \0113 @i{@r{is a tab followed by the character @samp{3}}} |
| \113 @i{@r{is the character with octal code 113 (since there}} |
| @i{@r{can be no more than 99 back references)}} |
| \377 @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}} |
| \81 @i{@r{is either a back reference, or a binary zero}} |
| @i{@r{followed by the two characters @samp{81}}} |
| @end example |
| |
| Note that octal values of 100 or greater must not be introduced |
| by a leading zero, because no more than three octal |
| digits are ever read. Note that this applies only to the LHS |
| pattern; it is not possible yet to specify more than 9 backreferences |
| on the RHS of the `s' command. |
| |
| All the sequences that define a single byte value can be |
| used both inside and outside character classes. In addition, |
| inside a character class, the sequence @code{\b} is interpreted |
| as the backspace character (hex 08). Outside a character |
| class it has a different meaning (see below). |
| |
| In addition, there are four additional escapes specifying |
| generic character classes (like @code{\w} and @code{\W} do): |
| |
| @cindex Perl-style regular expressions, character classes |
| @table @samp |
| @item \d |
| Matches any decimal digit |
| |
| @item \D |
| Matches any character that is not a decimal digit |
| @end table |
| |
| In Perl mode, these character type sequences can appear both inside and |
| outside character classes. Instead, in @sc{posix} mode these sequences |
| (as well as @code{\w} and @code{\W}) are treated as two literal characters |
| (a backslash and a letter) inside square brackets. |
| |
| Escaped sequences specifying assertions are also different in |
| Perl mode. An assertion specifies a condition that has to be met |
| at a particular point in a match, without consuming any |
| characters from the subject string. The use of subpatterns |
| for more complicated assertions is described below. The |
| backslashed assertions are |
| |
| @cindex Perl-style regular expressions, assertions |
| @table @samp |
| @item \b |
| Asserts that the point is at a word boundary. |
| A word boundary is a position in the subject string where |
| the current character and the previous character do not both |
| match @code{\w} or @code{\W} (i.e. one matches @code{\w} and |
| the other matches @code{\W}), or the start or end of the string |
| if the first or last character matches @code{\w}, respectively. |
| |
| @item \B |
| Asserts that the point is not at a word boundary. |
| |
| @item \A |
| Asserts the matcher is at the start of pattern space (independent |
| of multiline mode). |
| |
| @item \Z |
| Asserts the matcher is at the end of pattern space, |
| or at a newline before the end of pattern space (independent of |
| multiline mode) |
| |
| @item \z |
| Asserts the matcher is at the end of pattern space (independent |
| of multiline mode) |
| @end table |
| |
| These assertions may not appear in character classes (but |
| note that @code{\b} has a different meaning, namely the |
| backspace character, inside a character class). |
| Note that Perl mode does not support directly assertions |
| for the beginning and the end of word; the @acronym{GNU} extensions |
| @code{\<} and @code{\>} achieve this purpose in @sc{posix} mode |
| instead. |
| |
| The @code{\A}, @code{\Z}, and @code{\z} assertions differ |
| from the traditional circumflex and dollar sign (described below) |
| in that they only ever match at the very start and end of the |
| subject string, whatever options are set; in particular @code{\A} |
| and @code{\z} are the same as the @acronym{GNU} extensions |
| @code{\`} and @code{\'} that are active in @sc{posix} mode. |
| |
| @node Circumflex/dollar sign/period |
| @appendixsec Circumflex, dollar sign, period |
| @cindex Perl-style regular expressions, newlines |
| |
| Outside a character class, in the default matching mode, the |
| circumflex character is an assertion which is true only if |
| the current matching point is at the start of the subject |
| string. Inside a character class, the circumflex has an entirely |
| different meaning (see below). |
| |
| The circumflex need not be the first character of the pattern if |
| a number of alternatives are involved, but it should be the |
| first thing in each alternative in which it appears if the |
| pattern is ever to match that branch. If all possible alternatives, |
| start with a circumflex, that is, if the pattern is |
| constrained to match only at the start of the subject, it is |
| said to be an @dfn{anchored} pattern. (There are also other constructs |
| structs that can cause a pattern to be anchored.) |
| |
| A dollar sign is an assertion which is true only if the |
| current matching point is at the end of the subject string, |
| or immediately before a newline character that is the last |
| character in the string (by default). A dollar sign need not be the |
| last character of the pattern if a number of alternatives |
| are involved, but it should be the last item in any branch |
| in which it appears. A dollar sign has no special meaning in a |
| character class. |
| |
| @cindex Perl-style regular expressions, multiline |
| The meanings of the circumflex and dollar sign characters are |
| changed if the @code{M} modifier option is used. When this is |
| the case, they match immediately after and immediately |
| before an internal @code{\n} character, respectively, in addition |
| to matching at the start and end of the subject string. For |
| example, the pattern @code{/^abc$/} matches the subject string |
| @samp{def\nabc} in multiline mode, but not otherwise. Consequently, |
| patterns that are anchored in single line mode |
| because all branches start with @code{^} are not anchored in |
| multiline mode. |
| |
| @cindex Perl-style regular expressions, multiline |
| Note that the sequences @code{\A}, @code{\Z}, and @code{\z} |
| can be used to match the start and end of the subject in both |
| modes, and if all branches of a pattern start with @code{\A} |
| is it always anchored, whether the @code{M} modifier is set or not. |
| |
| @cindex Perl-style regular expressions, single line |
| Outside a character class, a dot in the pattern matches any |
| one character in the subject, including a non-printing character, |
| but not (by default) newline. If the @code{S} modifier is used, |
| dots match newlines as well. Actually, the handling of |
| dot is entirely independent of the handling of circumflex |
| and dollar sign, the only relationship being that they both |
| involve newline characters. Dot has no special meaning in a |
| character class. |
| |
| @node Square brackets |
| @appendixsec Square brackets |
| @cindex Perl-style regular expressions, character classes |
| |
| An opening square bracket introduces a character class, terminated |
| by a closing square bracket. A closing square bracket on its own |
| is not special. If a closing square bracket is required as a |
| member of the class, it should be the first data character in |
| the class (after an initial circumflex, if present) or escaped with a backslash. |
| |
| A character class matches a single character in the subject; |
| the character must be in the set of characters defined by |
| the class, unless the first character in the class is a circumflex, |
| in which case the subject character must not be in |
| the set defined by the class. If a circumflex is actually |
| required as a member of the class, ensure it is not the |
| first character, or escape it with a backslash. |
| |
| For example, the character class [aeiou] matches any lower |
| case vowel, while [^aeiou] matches any character that is not |
| a lower case vowel. Note that a circumflex is just a convenient |
| venient notation for specifying the characters which are in |
| the class by enumerating those that are not. It is not an |
| assertion: it still consumes a character from the subject |
| string, and fails if the current pointer is at the end of |
| the string. |
| |
| @cindex Perl-style regular expressions, case-insensitive |
| When caseless matching is set, any letters in a class |
| represent both their upper case and lower case versions, so |
| for example, a caseless @code{[aeiou]} matches uppercase |
| and lowercase @samp{A}s, and a caseless @code{[^aeiou]} |
| does not match @samp{A}, whereas a case-sensitive version would. |
| |
| @cindex Perl-style regular expressions, single line |
| @cindex Perl-style regular expressions, multiline |
| The newline character is never treated in any special way in |
| character classes, whatever the setting of the @code{S} and |
| @code{M} options (modifiers) is. A class such as @code{[^a]} will |
| always match a newline. |
| |
| The minus (hyphen) character can be used to specify a range |
| of characters in a character class. For example, @code{[d-m]} |
| matches any letter between d and m, inclusive. If a minus |
| character is required in a class, it must be escaped with a |
| backslash or appear in a position where it cannot be interpreted |
| as indicating a range, typically as the first or last |
| character in the class. |
| |
| It is not possible to have the literal character @code{]} as the |
| end character of a range. A pattern such as @code{[W-]46]} is |
| interpreted as a class of two characters (@code{W} and @code{-}) |
| followed by a literal string @code{46]}, so it would match |
| @samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped |
| with a backslash it is interpreted as the end of range, so |
| @code{[W-\]46]} is interpreted as a single class containing a |
| range followed by two separate characters. The octal or |
| hexadecimal representation of @code{]} can also be used to end a range. |
| |
| Ranges operate in @sc{ascii} collating sequence. They can also be |
| used for characters specified numerically, for example |
| @code{[\000-\037]}. If a range that includes letters is used when |
| caseless matching is set, it matches the letters in either |
| case. For example, a caseless @code{[W-c]} is equivalent to |
| @code{[][\^_`wxyzabc]}, matched caselessly, and if character |
| tables for the French locale are in use, @code{[\xc8-\xcb]} |
| matches accented E characters in both cases. |
| |
| Unlike in @sc{posix} mode, the character types @code{\d}, |
| @code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W} |
| may also appear in a character class, and add the characters |
| that they match to the class. For example, @code{[\dABCDEF]} matches any |
| hexadecimal digit. A circumflex can conveniently be used |
| with the upper case character types to specify a more restricted |
| set of characters than the matching lower case type. |
| For example, the class @code{[^\W_]} matches any letter or digit, |
| but not underscore. |
| |
| All non-alphameric characters other than @code{\}, @code{-}, |
| @code{^} (at the start) and the terminating @code{]} |
| are non-special in character classes, but it does no harm |
| if they are escaped. |
| |
| Perl 5.6 supports the @sc{posix} notation for character classes, which |
| uses names enclosed by @code{[:} and @code{:]} within the enclosing |
| square brackets, and @value{SSED} supports this notation as well. |
| For example, |
| |
| @example |
| [01[:alpha:]%] |
| @end example |
| |
| @noindent |
| matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}. |
| The supported class names are |
| |
| @table @code |
| @item alnum |
| Matches letters and digits |
| |
| @item alpha |
| Matches letters |
| |
| @item ascii |
| Matches character codes 0 - 127 |
| |
| @item cntrl |
| Matches control characters |
| |
| @item digit |
| Matches decimal digits (same as \d) |
| |
| @item graph |
| Matches printing characters, excluding space |
| |
| @item lower |
| Matches lower case letters |
| |
| @item print |
| Matches printing characters, including space |
| |
| @item punct |
| Matches printing characters, excluding letters and digits |
| |
| @item space |
| Matches white space (same as \s) |
| |
| @item upper |
| Matches upper case letters |
| |
| @item word |
| Matches ``word'' characters (same as \w) |
| |
| @item xdigit |
| Matches hexadecimal digits |
| @end table |
| |
| The names @code{ascii} and @code{word} are extensions valid only in |
| Perl mode. Another Perl extension is negation, which is |
| indicated by a circumflex character after the colon. For example, |
| |
| @example |
| [12[:^digit:]] |
| @end example |
| |
| @noindent |
| matches @samp{1}, @samp{2}, or any non-digit. |
| |
| @node Options setting |
| @appendixsec Options setting |
| @cindex Perl-style regular expressions, toggling options |
| @cindex Perl-style regular expressions, case-insensitive |
| @cindex Perl-style regular expressions, multiline |
| @cindex Perl-style regular expressions, single line |
| @cindex Perl-style regular expressions, extended |
| |
| The settings of the @code{I}, @code{M}, @code{S}, @code{X} |
| modifiers can be changed from within the pattern by |
| a sequence of Perl option letters enclosed between @code{(?} |
| and @code{)}. The option letters must be lowercase. |
| |
| For example, @code{(?im)} sets caseless, multiline matching. It is |
| also possible to unset these options by preceding the letter |
| with a hyphen; you can also have combined settings and unsettings: |
| @code{(?im-sx)} sets caseless and multiline matching, |
| while unsets single line matching (for dots) and extended |
| whitespace interpretation. If a letter appears both before |
| and after the hyphen, the option is unset. |
| |
| The scope of these option changes depends on where in the |
| pattern the setting occurs. For settings that are outside |
| any subpattern (defined below), the effect is the same as if |
| the options were set or unset at the start of matching. The |
| following patterns all behave in exactly the same way: |
| |
| @example |
| (?i)abc |
| a(?i)bc |
| ab(?i)c |
| abc(?i) |
| @end example |
| |
| which in turn is the same as specifying the pattern abc with |
| the @code{I} modifier. In other words, ``top level'' settings |
| apply to the whole pattern (unless there are other |
| changes inside subpatterns). If there is more than one setting |
| of the same option at top level, the rightmost setting |
| is used. |
| |
| If an option change occurs inside a subpattern, the effect |
| is different. This is a change of behaviour in Perl 5.005. |
| An option change inside a subpattern affects only that part |
| of the subpattern @emph{that follows} it, so |
| |
| @example |
| (a(?i)b)c |
| @end example |
| |
| @noindent |
| matches abc and aBc and no other strings (assuming |
| case-sensitive matching is used). By this means, options can |
| be made to have different settings in different parts of the |
| pattern. Any changes made in one alternative do carry on |
| into subsequent branches within the same subpattern. For |
| example, |
| |
| @example |
| (a(?i)b|c) |
| @end example |
| |
| @noindent |
| matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C}, |
| even though when matching @samp{C} the first branch is |
| abandoned before the option setting. |
| This is because the effects of option settings happen at |
| compile time. There would be some very weird behaviour otherwise. |
| |
| @ignore |
| There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA |
| that can be changed in the same way as the Perl-compatible options by |
| using the characters U and X respectively. The (?X) flag |
| setting is special in that it must always occur earlier in |
| the pattern than any of the additional features it turns on, |
| even when it is at top level. It is best put at the start. |
| @end ignore |
| |
| |
| @node Non-capturing subpatterns |
| @appendixsec Non-capturing subpatterns |
| @cindex Perl-style regular expressions, non-capturing subpatterns |
| |
| Marking part of a pattern as a subpattern does two things. |
| On one hand, it localizes a set of alternatives; on the other |
| hand, it sets up the subpattern as a capturing subpattern (as |
| defined above). The subpattern can be backreferenced and |
| referenced in the right side of @code{s} commands. |
| |
| For example, if the string @samp{the red king} is matched against |
| the pattern |
| |
| @example |
| the ((red|white) (king|queen)) |
| @end example |
| |
| @noindent |
| the captured substrings are @samp{red king}, @samp{red}, |
| and @samp{king}, and are numbered 1, 2, and 3. |
| |
| The fact that plain parentheses fulfil two functions is not |
| always helpful. There are often times when a grouping |
| subpattern is required without a capturing requirement. If an |
| opening parenthesis is followed by @code{?:}, the subpattern does |
| not do any capturing, and is not counted when computing the |
| number of any subsequent capturing subpatterns. For example, |
| if the string @samp{the white queen} is matched against the pattern |
| |
| @example |
| the ((?:red|white) (king|queen)) |
| @end example |
| |
| @noindent |
| the captured substrings are @samp{white queen} and @samp{queen}, |
| and are numbered 1 and 2. The maximum number of captured |
| substrings is 99, while the maximum number of all subpatterns, |
| both capturing and non-capturing, is 200. |
| |
| As a convenient shorthand, if any option settings are |
| equired at the start of a non-capturing subpattern, the |
| option letters may appear between the @code{?} and the |
| @code{:}. Thus the two patterns |
| |
| @example |
| (?i:saturday|sunday) |
| (?:(?i)saturday|sunday) |
| @end example |
| |
| @noindent |
| match exactly the same set of strings. Because alternative |
| branches are tried from left to right, and options are not |
| reset until the end of the subpattern is reached, an option |
| setting in one branch does affect subsequent branches, so |
| the above patterns match @samp{SUNDAY} as well as @samp{Saturday}. |
| |
| |
| @node Repetition |
| @appendixsec Repetition |
| @cindex Perl-style regular expressions, repetitions |
| |
| Repetition is specified by quantifiers, which can follow any |
| of the following items: |
| |
| @itemize @bullet |
| @item |
| a single character, possibly escaped |
| |
| @item |
| the @code{.} special character |
| |
| @item |
| a character class |
| |
| @item |
| a back reference (see next section) |
| |
| @item |
| a parenthesized subpattern (unless it is an assertion; @pxref{Assertions}) |
| @end itemize |
| |
| The general repetition quantifier specifies a minimum and |
| maximum number of permitted matches, by giving the two |
| numbers in curly brackets (braces), separated by a comma. |
| The numbers must be less than 65536, and the first must be |
| less than or equal to the second. For example: |
| |
| @example |
| z@{2,4@} |
| @end example |
| |
| @noindent |
| matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own |
| is not a special character. If the second number is omitted, |
| but the comma is present, there is no upper limit; if the |
| second number and the comma are both omitted, the quantifier |
| specifies an exact number of required matches. Thus |
| |
| @example |
| [aeiou]@{3,@} |
| @end example |
| |
| @noindent |
| matches at least 3 successive vowels, but may match many |
| more, while |
| |
| @example |
| \d@{8@} |
| @end example |
| |
| @noindent |
| matches exactly 8 digits. An opening curly bracket that |
| appears in a position where a quantifier is not allowed, or |
| one that does not match the syntax of a quantifier, is taken |
| as a literal character. For example, @{,6@} is not a quantifier, |
| but a literal string of four characters.@footnote{It |
| raises an error if @option{-R} is not used.} |
| |
| The quantifier @samp{@{0@}} is permitted, causing the expression to |
| behave as if the previous item and the quantifier were not |
| present. |
| |
| For convenience (and historical compatibility) the three |
| most common quantifiers have single-character abbreviations: |
| |
| @table @code |
| @item * |
| is equivalent to @{0,@} |
| |
| @item + |
| is equivalent to @{1,@} |
| |
| @item ? |
| is equivalent to @{0,1@} |
| @end table |
| |
| It is possible to construct infinite loops by following a |
| subpattern that can match no characters with a quantifier |
| that has no upper limit, for example: |
| |
| @example |
| (a?)* |
| @end example |
| |
| Earlier versions of Perl used to give an error at |
| compile time for such patterns. However, because there are |
| cases where this can be useful, such patterns are now |
| accepted, but if any repetition of the subpattern does in |
| fact match no characters, the loop is forcibly broken. |
| |
| @cindex Greedy regular expression matching |
| @cindex Perl-style regular expressions, stingy repetitions |
| By default, the quantifiers are @dfn{greedy} like in @sc{posix} |
| mode, that is, they match as much as possible (up to the maximum |
| number of permitted times), without causing the rest of the |
| pattern to fail. The classic example of where this gives problems |
| is in trying to match comments in C programs. These appear between |
| the sequences @code{/*} and @code{*/} and within the sequence, individual |
| @code{*} and @code{/} characters may appear. An attempt to match C |
| comments by applying the pattern |
| |
| @example |
| /\*.*\*/ |
| @end example |
| |
| @noindent |
| to the string |
| |
| @example |
| /* first command */ not comment /* second comment */ |
| @end example |
| |
| @noindent |
| |
| fails, because it matches the entire string owing to the |
| greediness of the @code{.*} item. |
| |
| However, if a quantifier is followed by a question mark, it |
| ceases to be greedy, and instead matches the minimum number |
| of times possible, so the pattern @code{/\*.*?\*/} |
| does the right thing with the C comments. The meaning of the |
| various quantifiers is not otherwise changed, just the preferred |
| number of matches. Do not confuse this use of question |
| mark with its use as a quantifier in its own right. |
| Because it has two uses, it can sometimes appear doubled, as in |
| |
| @example |
| \d??\d |
| @end example |
| |
| which matches one digit by preference, but can match two if |
| that is the only way the rest of the pattern matches. |
| |
| Note that greediness does not matter when specifying addresses, |
| but can be nevertheless used to improve performance. |
| |
| @ignore |
| If the PCRE_UNGREEDY option is set (an option which is not |
| available in Perl), the quantifiers are not greedy by |
| default, but individual ones can be made greedy by following |
| them with a question mark. In other words, it inverts the |
| default behaviour. |
| @end ignore |
| |
| When a parenthesized subpattern is quantified with a minimum |
| repeat count that is greater than 1 or with a limited maximum, |
| more store is required for the compiled pattern, in |
| proportion to the size of the minimum or maximum. |
| |
| @cindex Perl-style regular expressions, single line |
| If a pattern starts with @code{.*} or @code{.@{0,@}} and the |
| @code{S} modifier is used, the pattern is implicitly anchored, |
| because whatever follows will be tried against every character |
| position in the subject string, so there is no point in |
| retrying the overall match at any position after the first. |
| PCRE treats such a pattern as though it were preceded by \A. |
| |
| When a capturing subpattern is repeated, the value captured |
| is the substring that matched the final iteration. For example, |
| after |
| |
| @example |
| (tweedle[dume]@{3@}\s*)+ |
| @end example |
| |
| @noindent |
| has matched @samp{tweedledum tweedledee} the value of the |
| captured substring is @samp{tweedledee}. However, if there are |
| nested capturing subpatterns, the corresponding captured |
| values may have been set in previous iterations. For example, |
| after |
| |
| @example |
| /(a|(b))+/ |
| @end example |
| |
| matches @samp{aba}, the value of the second captured substring is |
| @samp{b}. |
| |
| @node Backreferences |
| @appendixsec Backreferences |
| @cindex Perl-style regular expressions, backreferences |
| |
| Outside a character class, a backslash followed by a digit |
| greater than 0 (and possibly further digits) is a back |
| reference to a capturing subpattern earlier (i.e. to its |
| left) in the pattern, provided there have been that many |
| previous capturing left parentheses. |
| |
| However, if the decimal number following the backslash is |
| less than 10, it is always taken as a back reference, and |
| causes an error only if there are not that many capturing |
| left parentheses in the entire pattern. In other words, the |
| parentheses that are referenced need not be to the left of |
| the reference for numbers less than 10. @ref{Backslash} |
| for further details of the handling of digits following a backslash. |
| |
| A back reference matches whatever actually matched the capturing |
| subpattern in the current subject string, rather than |
| anything matching the subpattern itself. So the pattern |
| |
| @example |
| (sens|respons)e and \1ibility |
| @end example |
| |
| @noindent |
| matches @samp{sense and sensibility} and @samp{response and responsibility}, |
| but not @samp{sense and responsibility}. If caseful |
| matching is in force at the time of the back reference, the |
| case of letters is relevant. For example, |
| |
| @example |
| ((?i)blah)\s+\1 |
| @end example |
| |
| @noindent |
| matches @samp{blah blah} and @samp{Blah Blah}, but not |
| @samp{BLAH blah}, even though the original capturing |
| subpattern is matched caselessly. |
| |
| There may be more than one back reference to the same subpattern. |
| Also, if a subpattern has not actually been used in a |
| particular match, any back references to it always fail. For |
| example, the pattern |
| |
| @example |
| (a|(bc))\2 |
| @end example |
| |
| @noindent |
| always fails if it starts to match @samp{a} rather than |
| @samp{bc}. Because there may be up to 99 back references, all |
| digits following the backslash are taken as part of a potential |
| back reference number; this is different from what happens |
| in @sc{posix} mode. If the pattern continues with a digit |
| character, some delimiter must be used to terminate the back |
| reference. If the @code{X} modifier option is set, this can be |
| whitespace. Otherwise an empty comment can be used, or the |
| following character can be expressed in hexadecimal or octal. |
| Note that this applies only to the LHS pattern; it is |
| not possible yet to specify more than 9 backreferences on the |
| RHS of the `s' command. |
| |
| A back reference that occurs inside the parentheses to which |
| it refers fails when the subpattern is first used, so, for |
| example, @code{(a\1)} never matches. However, such references |
| can be useful inside repeated subpatterns. For example, the |
| pattern |
| |
| @example |
| (a|b\1)+ |
| @end example |
| |
| @noindent |
| matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa}, |
| etc. At each iteration of the subpattern, the back reference matches |
| the character string corresponding to the previous iteration. In |
| order for this to work, the pattern must be such that the first |
| iteration does not need to match the back reference. This can be |
| done using alternation, as in the example above, or by a |
| quantifier with a minimum of zero. |
| |
| @node Assertions |
| @appendixsec Assertions |
| @cindex Perl-style regular expressions, assertions |
| @cindex Perl-style regular expressions, asserting subpatterns |
| |
| An assertion is a test on the characters following or |
| preceding the current matching point that does not actually |
| consume any characters. The simple assertions coded as @code{\b}, |
| @code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$} |
| are described above. More complicated assertions are coded as |
| subpatterns. There are two kinds: those that look ahead of the |
| current position in the subject string, and those that look behind it. |
| |
| @cindex Perl-style regular expressions, lookahead subpatterns |
| An assertion subpattern is matched in the normal way, except |
| that it does not cause the current matching position to be |
| changed. Lookahead assertions start with @code{(?=} for positive |
| assertions and @code{(?!} for negative assertions. For example, |
| |
| @example |
| \w+(?=;) |
| @end example |
| |
| @noindent |
| matches a word followed by a semicolon, but does not include |
| the semicolon in the match, and |
| |
| @example |
| foo(?!bar) |
| @end example |
| |
| @noindent |
| matches any occurrence of @samp{foo} that is not followed by |
| @samp{bar}. |
| |
| Note that the apparently similar pattern |
| |
| @example |
| (?!foo)bar |
| @end example |
| |
| @noindent |
| @cindex Perl-style regular expressions, lookbehind subpatterns |
| finds any occurrence of @samp{bar} even if it is preceded by |
| @samp{foo}, because the assertion @code{(?!foo)} is always true |
| when the next three characters are @samp{bar}. A lookbehind |
| assertion is needed to achieve this effect. |
| Lookbehind assertions start with @code{(?<=} for positive |
| assertions and @code{(?<!} for negative assertions. So, |
| |
| @example |
| (?<!foo)bar |
| @end example |
| |
| achieves the required effect of finding an occurrence of |
| @samp{bar} that is not preceded by @samp{foo}. The contents of a |
| lookbehind assertion are restricted |
| such that all the strings it matches must have a fixed |
| length. However, if there are several alternatives, they do |
| not all have to have the same fixed length. This is an extension |
| compared with Perl 5.005, which requires all branches to match |
| the same length of string. Thus |
| |
| @example |
| (?<=dogs|cats|) |
| @end example |
| |
| @noindent |
| is permitted, but the apparently equivalent regular expression |
| |
| @example |
| (?<!dogs?|cats?) |
| @end example |
| |
| @noindent |
| causes an error at compile time. Branches that match different |
| length strings are permitted only at the top level of |
| a lookbehind assertion: an assertion such as |
| |
| @example |
| (?<=ab(c|de)) |
| @end example |
| |
| @noindent |
| is not permitted, because its single top-level branch can |
| match two different lengths, but it is acceptable if rewritten |
| to use two top-level branches: |
| |
| @example |
| (?<=abc|abde) |
| @end example |
| |
| All this is required because lookbehind assertions simply |
| move the current position back by the alternative's fixed |
| width and then try to match. If there are |
| insufficient characters before the current position, the |
| match is deemed to fail. Lookbehinds, in conjunction with |
| non-backtracking subpatterns can be particularly useful for |
| matching at the ends of strings; an example is given at the end |
| of the section on non-backtracking subpatterns. |
| |
| Several assertions (of any sort) may occur in succession. |
| For example, |
| |
| @example |
| (?<=\d@{3@})(?<!999)foo |
| @end example |
| |
| @noindent |
| matches @samp{foo} preceded by three digits that are not @samp{999}. |
| Notice that each of the assertions is applied independently |
| at the same point in the subject string. First there is a |
| check that the previous three characters are all digits, and |
| then there is a check that the same three characters are not |
| @samp{999}. This pattern does not match @samp{foo} preceded by six |
| characters, the first of which are digits and the last three |
| of which are not @samp{999}. For example, it doesn't match |
| @samp{123abcfoo}. A pattern to do that is |
| |
| @example |
| (?<=\d@{3@}...)(?<!999)foo |
| @end example |
| |
| @noindent |
| This time the first assertion looks at the preceding six |
| characters, checking that the first three are digits, and |
| then the second assertion checks that the preceding three |
| characters are not @samp{999}. Actually, assertions can be |
| nested in any combination, so one can write this as |
| |
| @example |
| (?<=\d@{3@}(?!999)...)foo |
| @end example |
| |
| or |
| |
| @example |
| (?<=\d@{3@}...(?<!999))foo |
| @end example |
| |
| @noindent |
| both of which might be considered more readable. |
| |
| Assertion subpatterns are not capturing subpatterns, and may |
| not be repeated, because it makes no sense to assert the |
| same thing several times. If any kind of assertion contains |
| capturing subpatterns within it, these are counted for the |
| purposes of numbering the capturing subpatterns in the whole |
| pattern. However, substring capturing is carried out only |
| for positive assertions, because it does not make sense for |
| negative assertions. |
| |
| Assertions count towards the maximum of 200 parenthesized |
| subpatterns. |
| |
| @node Non-backtracking subpatterns |
| @appendixsec Non-backtracking subpatterns |
| @cindex Perl-style regular expressions, non-backtracking subpatterns |
| |
| With both maximizing and minimizing repetition, failure of |
| what follows normally causes the repeated item to be evaluated |
| again to see if a different number of repeats allows the |
| rest of the pattern to match. Sometimes it is useful to |
| prevent this, either to change the nature of the match, or |
| to cause it fail earlier than it otherwise might, when the |
| author of the pattern knows there is no point in carrying |
| on. |
| |
| Consider, for example, the pattern @code{\d+foo} when applied to |
| the subject line |
| |
| @example |
| 123456bar |
| @end example |
| |
| After matching all 6 digits and then failing to match @samp{foo}, |
| the normal action of the matcher is to try again with only 5 |
| digits matching the @code{\d+} item, and then with 4, and so on, |
| before ultimately failing. Non-backtracking subpatterns |
| provide the means for specifying that once a portion of the |
| pattern has matched, it is not to be re-evaluated in this way, |
| so the matcher would give up immediately on failing to match |
| @samp{foo} the first time. The notation is another kind of special |
| parenthesis, starting with @code{(?>} as in this example: |
| |
| @example |
| (?>\d+)bar |
| @end example |
| |
| This kind of parenthesis ``locks up'' the part of the pattern |
| it contains once it has matched, and a failure further into |
| the pattern is prevented from backtracking into it. |
| Backtracking past it to previous items, however, works as |
| normal. |
| |
| Non-backtracking subpatterns are not capturing subpatterns. Simple |
| cases such as the above example can be thought of as a maximizing |
| repeat that must swallow everything it can. So, |
| while both @code{\d+} and @code{\d+?} are prepared to adjust the number of |
| digits they match in order to make the rest of the pattern |
| match, @code{(?>\d+)} can only match an entire sequence of digits. |
| |
| This construction can of course contain arbitrarily complicated |
| subpatterns, and it can be nested. |
| |
| @cindex Perl-style regular expressions, lookbehind subpatterns |
| Non-backtracking subpatterns can be used in conjunction with look-behind |
| assertions to specify efficient matching at the end |
| of the subject string. Consider a simple pattern such as |
| |
| @example |
| abcd$ |
| @end example |
| |
| @noindent |
| when applied to a long string which does not match. Because |
| matching proceeds from left to right, @command{sed} will look for |
| each @samp{a} in the subject and then see if what follows matches |
| the rest of the pattern. If the pattern is specified as |
| |
| @example |
| ^.*abcd$ |
| @end example |
| |
| @noindent |
| the initial @code{.*} matches the entire string at first, but when |
| this fails (because there is no following @samp{a}), it backtracks |
| to match all but the last character, then all but the |
| last two characters, and so on. Once again the search for |
| @samp{a} covers the entire string, from right to left, so we are |
| no better off. However, if the pattern is written as |
| |
| @example |
| ^(?>.*)(?<=abcd) |
| @end example |
| |
| there can be no backtracking for the .* item; it can match |
| only the entire string. The subsequent lookbehind assertion |
| does a single test on the last four characters. If it fails, |
| the match fails immediately. For long strings, this approach |
| makes a significant difference to the processing time. |
| |
| When a pattern contains an unlimited repeat inside a subpattern |
| that can itself be repeated an unlimited number of |
| times, the use of a once-only subpattern is the only way to |
| avoid some failing matches taking a very long time |
| indeed.@footnote{Actually, the matcher embedded in @value{SSED} |
| tries to do something for this in the simplest cases, |
| like @code{([^b]*b)*}. These cases are actually quite |
| common: they happen for example in a regular expression |
| like @code{\/\*([^*]*\*)*\/} which matches C comments.} |
| |
| The pattern |
| |
| @example |
| (\D+|<\d+>)*[!?] |
| @end example |
| |
| ([^0-9<]+<(\d+>)?)*[!?] |
| |
| @noindent |
| matches an unlimited number of substrings that either consist |
| of non-digits, or digits enclosed in angular brackets, followed by |
| an exclamation or question mark. When it matches, it runs quickly. |
| However, if it is applied to |
| |
| @example |
| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
| @end example |
| |
| @noindent |
| it takes a long time before reporting failure. This is |
| because the string can be divided between the two repeats in |
| a large number of ways, and all have to be tried.@footnote{The |
| example used @code{[!?]} rather than a single character at the end, |
| because both @value{SSED} and Perl have an optimization that allows |
| for fast failure when a single character is used. They |
| remember the last single character that is required for a |
| match, and fail early if it is not present in the string.} |
| |
| If the pattern is changed to |
| |
| @example |
| ((?>\D+)|<\d+>)*[!?] |
| @end example |
| |
| sequences of non-digits cannot be broken, and failure happens |
| quickly. |
| |
| @node Conditional subpatterns |
| @appendixsec Conditional subpatterns |
| @cindex Perl-style regular expressions, conditional subpatterns |
| |
| It is possible to cause the matching process to obey a subpattern |
| conditionally or to choose between two alternative |
| subpatterns, depending on the result of an assertion, or |
| whether a previous capturing subpattern matched or not. The |
| two possible forms of conditional subpattern are |
| |
| @example |
| (?(@var{condition})@var{yes-pattern}) |
| (?(@var{condition})@var{yes-pattern}|@var{no-pattern}) |
| @end example |
| |
| If the condition is satisfied, the yes-pattern is used; otherwise |
| the no-pattern (if present) is used. If there are more than two |
| alternatives in the subpattern, a compile-time error occurs. |
| |
| There are two kinds of condition. If the text between the |
| parentheses consists of a sequence of digits, the condition |
| is satisfied if the capturing subpattern of that number has |
| previously matched. The number must be greater than zero. |
| Consider the following pattern, which contains non-significant |
| white space to make it more readable (assume the @code{X} modifier) |
| and to divide it into three parts for ease of discussion: |
| |
| @example |
| ( \( )? [^()]+ (?(1) \) ) |
| @end example |
| |
| The first part matches an optional opening parenthesis, and |
| if that character is present, sets it as the first captured |
| substring. The second part matches one or more characters |
| that are not parentheses. The third part is a conditional |
| subpattern that tests whether the first set of parentheses |
| matched or not. If they did, that is, if subject started |
| with an opening parenthesis, the condition is true, and so |
| the yes-pattern is executed and a closing parenthesis is |
| required. Otherwise, since no-pattern is not present, the |
| subpattern matches nothing. In other words, this pattern |
| matches a sequence of non-parentheses, optionally enclosed |
| in parentheses. |
| |
| @cindex Perl-style regular expressions, lookahead subpatterns |
| If the condition is not a sequence of digits, it must be an |
| assertion. This may be a positive or negative lookahead or |
| lookbehind assertion. Consider this pattern, again containing |
| non-significant white space, and with the two alternatives |
| on the second line: |
| |
| @example |
| (?(?=...[a-z]) |
| \d\d-[a-z]@{3@}-\d\d | |
| \d\d-\d\d-\d\d ) |
| @end example |
| |
| The condition is a positive lookahead assertion that matches |
| a letter that is three characters away from the current point. |
| If a letter is found, the subject is matched against the first |
| alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are |
| letters and @var{dd} are digits); otherwise it is matched against |
| the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}. |
| |
| |
| @node Recursive patterns |
| @appendixsec Recursive patterns |
| @cindex Perl-style regular expressions, recursive patterns |
| @cindex Perl-style regular expressions, recursion |
| |
| Consider the problem of matching a string in parentheses, |
| allowing for unlimited nested parentheses. Without the use |
| of recursion, the best that can be done is to use a pattern |
| that matches up to some fixed depth of nesting. It is not |
| possible to handle an arbitrary nesting depth. Perl 5.6 has |
| provided an experimental facility that allows regular |
| expressions to recurse (amongst other things). It does this |
| by interpolating Perl code in the expression at run time, |
| and the code can refer to the expression itself. A Perl pattern |
| tern to solve the parentheses problem can be created like |
| this: |
| |
| @example |
| $re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x; |
| @end example |
| |
| The @code{(?p@{...@})} item interpolates Perl code at run time, |
| and in this case refers recursively to the pattern in which it |
| appears. Obviously, @command{sed} cannot support the interpolation of |
| Perl code. Instead, the special item @code{(?R)} is provided for |
| the specific case of recursion. This pattern solves the |
| parentheses problem (assume the @code{X} modifier option is used |
| so that white space is ignored): |
| |
| @example |
| \( ( (?>[^()]+) | (?R) )* \) |
| @end example |
| |
| First it matches an opening parenthesis. Then it matches any |
| number of substrings which can either be a sequence of |
| non-parentheses, or a recursive match of the pattern itself |
| (i.e. a correctly parenthesized substring). Finally there is |
| a closing parenthesis. |
| |
| This particular example pattern contains nested unlimited |
| repeats, and so the use of a non-backtracking subpattern for |
| matching strings of non-parentheses is important when applying |
| the pattern to strings that do not match. For example, when |
| it is applied to |
| |
| @example |
| (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
| @end example |
| |
| it yields a ``no match'' response quickly. However, if a |
| standard backtracking subpattern is not used, the match runs |
| for a very long time indeed because there are so many different |
| ways the @code{+} and @code{*} repeats can carve up the subject, |
| and all have to be tested before failure can be reported. |
| |
| The values set for any capturing subpatterns are those from |
| the outermost level of the recursion at which the subpattern |
| value is set. If the pattern above is matched against |
| |
| @example |
| (ab(cd)ef) |
| @end example |
| |
| @noindent |
| the value for the capturing parentheses is @samp{ef}, which is |
| the last value taken on at the top level. |
| |
| @node Comments |
| @appendixsec Comments |
| @cindex Perl-style regular expressions, comments |
| |
| The sequence (?# marks the start of a comment which continues |
| ues up to the next closing parenthesis. Nested parentheses |
| are not permitted. The characters that make up a comment |
| play no part in the pattern matching at all. |
| |
| @cindex Perl-style regular expressions, extended |
| If the @code{X} modifier option is used, an unescaped @code{#} character |
| outside a character class introduces a comment that continues |
| up to the next newline character in the pattern. |
| @end ifset |
| |
| |
| @page |
| @node Concept Index |
| @unnumbered Concept Index |
| |
| This is a general index of all issues discussed in this manual, with the |
| exception of the @command{sed} commands and command-line options. |
| |
| @printindex cp |
| |
| @page |
| @node Command and Option Index |
| @unnumbered Command and Option Index |
| |
| This is an alphabetical list of all @command{sed} commands and command-line |
| options. |
| |
| @printindex fn |
| |
| @contents |
| @bye |
| |
| @c XXX FIXME: the term "cycle" is never defined... |