Part of the Sus Filter Tools: Scans a text for keywords and outputs a sus file.
text2sus [<option>] [<label>|<label>=<regexp>|<regexp>] ...
'-'
)
'-'
)
--next=\n
~/.labrc
and ./labrc
@FILE
or @ FILE
(some) command-line options are read
from FILE
(see section There are three possibilities how to define a label:
"label"
,
"label=label"
,
"label:label[:= \t]*(\w+)"
, and
"label[:= \t]*(?P<label>w+)"
SPMquot
label[:= ]*(?P<label>[+-]?+?([eE][+-]?+)?|+)"
A regular expression (or REGEXP) specifies a set of strings that matches it.
Regular expressions can be concatenated to form new regular
expressions; if A
and B
are both regular expressions,
then AB
is
also an regular expression. If a string p
matches A
and another
string q
matches B
, the string *pq*
will match AB
.
A brief explanation of a part the format of regular expressions follows. For further information and a gentler presentation, consult the Regular Expression HOWTO, accessible from http://www.python.org/doc/howto/.
Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like `A'
, `a'
, or `0'
, are the
simplest regular expressions; they simply match themselves. You can
concatenate ordinary characters, so "last"
matches the string
`last'
. (In the rest of this section, we'll write REGEXP's in
"this special style"
, usually without quotes, and strings to be
matched `in single quotes'
.)
Some characters, like `|'
or `('
, are special. Special
characters either stand for classes of ordinary characters, or affect how the
regular expressions around them are interpreted.
The special characters are:
^
'
(Caret.) Matches the start of the string and
immediately after each newline.
"foo"
matches both 'foo'
and 'foobar'
,
while the regular expression "foo$"
matches only
'foo'
.
"ab*"
will
match 'a'
, 'ab'
, or 'a'
followed by any number of 'b'
s.
"ab+"
will match 'a'
followed by any non-zero
number of 'b'
s; it will not match just 'a'
.
"ab?"
will match either 'a'
or 'ab'
.
`*'
, `+'
, and `?'
qualifiers are all "greedy"; they match as
much text as possible. Sometimes this behaviour isn't desired; if
the REGEXP "<.*>"
is matched against `<H1>title</H1>'
, it will match
the entire string, and not just `<H1>'
. Adding `?'
after the
qualifier makes it perform the match in "non-greedy" or "minimal"
fashion; as few characters as possible will be matched. Using
".*?"
in the previous expression will match only `<H1>'
.
\
'
Either escapes special characters (permitting you to match
characters like `*'
, `?'
, and so forth), or signals a special
sequence; special sequences are discussed below.
If you are not using a command-line file (see section
), remember
that most shells also use the backslash as an escape sequence in
the command line; therefore you have to put the regular expression
into
'
-quotes to prevent an interpretation by the shell.
`-'
. Special
characters are not active inside sets. For example, "[!akm]"
will
match any of the characters `a'
, `k'
, `m'
, or `!'
; "[a-z]"
will
match any lowercase letter, and `[a-zA-Z0-9]'
matches any letter
or digit. Character classes such as `\w'
or `\S'
(defined below)
are also acceptable inside a range. If you want to include a `]'
or a `-'
inside a set, precede it with a backslash, or place it as
the first character. The pattern "[]]"
will match `]'
, for
example.
You can match the characters not within a range by "complementing"
the set. This is indicated by including a `^'
as the first
character of the set; `^'
elsewhere will simply match the `^'
character. For example, "[^5]"
will match any character except
`5'
.
`A|B'
, where A and B can be arbitrary REGEXPs, creates a regular
expression that will match either A or B. This can be used inside
groups (see below) as well. To match a literal `|'
, use "\|"
, or
enclose it inside a character class, as in "[|]"
.
`('
or `)'
, use "\("
or
"\)"
, or enclose them inside a character class: "[(] [)]"
.
`?'
following a `('
is not
meaningful otherwise). The first character after the `?'
determines what the meaning and further syntax of the construct is.
Extensions usually do not create a new group; "(?P<NAME>...)"
is
the only exception to this rule. Following are the some of the currently
supported extensions.
"..."
matches next, but doesn't consume any of the
string. This is called a lookahead assertion. For example,
"Isaac (?=Asimov)"
will match `Isaac '
only if it's followed by
`Asimov'
.
"..."
doesn't match next. This is a negative lookahead
assertion. For example, "Isaac (?!Asimov)"
will match
`Isaac '
only if it's not followed by `Asimov'
.
The special sequences consist of `\'
and a character from the list
below. If the ordinary character is not on the list, then the
resulting REGEXP will match the second character. For example, "\$"
matches the character `$'
.
\A
'
Matches only at the start of the string.
\b
'
Matches the empty string, but only at the beginning or end of a
word. A word is defined as a sequence of alphanumeric characters,
so the end of a word is indicated by whitespace or a
non-alphanumeric character.
\B
'
Matches the empty string, but only when it is not at the
beginning or end of a word.
\d
'
Matches any decimal digit; this is equivalent to the set "[0-9]"
.
\D
'
Matches any non-digit character; this is equivalent to the set
"[^0-9]"
.
\s
'
Matches any whitespace character; this is equivalent to the set
"[ \t\n\r\f\v]"
.
\S
'
Matches any non-whitespace character; equivalent to the
set "[^ \t\n\r\f\v]"
.
\w
'
This is equivalent to the set "[a-zA-Z0-9_]"
, the alphanumeric characters.
\W
'
This is equivalent to the set "[^a-zA-Z0-9_]"
, the
non-alphanumeric characters.
\Z
'
Matches only at the end of the string.
\\
'
Matches a literal backslash.
A sus file containing data from the text.