Part of the Sus Filter Tools: Scans a text for keywords and outputs a sus file.
text2sus [<option>] [<label>|<label>=<regexp>|<regexp>] ...
@ FILE(some) command-line options are read from
FILE(see section ).
There are three possibilities how to define a label:
"label:label[:= \t]*(\w+)", and
A regular expression (or REGEXP) specifies a set of strings that matches it.
Regular expressions can be concatenated to form new regular
B are both regular expressions,
also an regular expression. If a string
A and another
B, the string
*pq* will match
A brief explanation of a part the format of regular expressions follows. For further information and a gentler presentation, consult the Regular Expression HOWTO, accessible from http://www.python.org/doc/howto/.
Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like
`0', are the
simplest regular expressions; they simply match themselves. You can
concatenate ordinary characters, so
"last" matches the string
`last'. (In the rest of this section, we'll write REGEXP's in
"this special style", usually without quotes, and strings to be
`in single quotes'.)
Some characters, like
`(', are special. Special
characters either stand for classes of ordinary characters, or affect how the
regular expressions around them are interpreted.
The special characters are:
^' (Caret.) Matches the start of the string and immediately after each newline.
'foobar', while the regular expression
'a'followed by any number of
'a'followed by any non-zero number of
'b's; it will not match just
"ab?"will match either
`?'qualifiers are all "greedy"; they match as much text as possible. Sometimes this behaviour isn't desired; if the REGEXP
"<.*>"is matched against
`<H1>title</H1>', it will match the entire string, and not just
`?'after the qualifier makes it perform the match in "non-greedy" or "minimal" fashion; as few characters as possible will be matched. Using
".*?"in the previous expression will match only
\' Either escapes special characters (permitting you to match characters like
`?', and so forth), or signals a special sequence; special sequences are discussed below.
If you are not using a command-line file (see section
that most shells also use the backslash as an escape sequence in
the command line; therefore you have to put the regular expression
'-quotes to prevent an interpretation by the shell.
`-'. Special characters are not active inside sets. For example,
"[!akm]"will match any of the characters
"[a-z]"will match any lowercase letter, and
`[a-zA-Z0-9]'matches any letter or digit. Character classes such as
`\S'(defined below) are also acceptable inside a range. If you want to include a
`-'inside a set, precede it with a backslash, or place it as the first character. The pattern
`]', for example.
You can match the characters not within a range by "complementing"
the set. This is indicated by including a
`^' as the first
character of the set;
`^' elsewhere will simply match the
character. For example,
"[^5]" will match any character except
`A|B', where A and B can be arbitrary REGEXPs, creates a regular expression that will match either A or B. This can be used inside groups (see below) as well. To match a literal
"\|", or enclose it inside a character class, as in
"\)", or enclose them inside a character class:
`('is not meaningful otherwise). The first character after the
`?'determines what the meaning and further syntax of the construct is. Extensions usually do not create a new group;
"(?P<NAME>...)"is the only exception to this rule. Following are the some of the currently supported extensions.
"..."matches next, but doesn't consume any of the string. This is called a lookahead assertion. For example,
"Isaac (?=Asimov)"will match
`Isaac 'only if it's followed by
"..."doesn't match next. This is a negative lookahead assertion. For example,
"Isaac (?!Asimov)"will match
`Isaac 'only if it's not followed by
The special sequences consist of
`\' and a character from the list
below. If the ordinary character is not on the list, then the
resulting REGEXP will match the second character. For example,
matches the character
\A' Matches only at the start of the string.
\b' Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.
\B' Matches the empty string, but only when it is not at the beginning or end of a word.
\d' Matches any decimal digit; this is equivalent to the set
\D' Matches any non-digit character; this is equivalent to the set
\s' Matches any whitespace character; this is equivalent to the set
\S' Matches any non-whitespace character; equivalent to the set
\w' This is equivalent to the set
"[a-zA-Z0-9_]", the alphanumeric characters.
\W' This is equivalent to the set
"[^a-zA-Z0-9_]", the non-alphanumeric characters.
\Z' Matches only at the end of the string.
\\' Matches a literal backslash.
A sus file containing data from the text.
See Section .