Tuesday, July 3, 2018

perl regexp


Here are some special RE characters and their meaning
.    # Any single character except a newline
^    # The beginning of the line or string
$    # The end of the line or string
*    # Zero or more of the last character
+    # One or more of the last character
?    # Zero or one of the last character
and here are some example matches. Remember that should be enclosed in /.../ slashes to be used.
t.e  # t followed by anthing followed by e
     # This will match the
     #                 tre
     #                 tle
     # but not te
     #         tale
^f   # f at the beginning of a line
^ftp # ftp at the beginning of a line
e$   # e at the end of a line
tle$ # tle at the end of a line
und* # un followed by zero or more d characters
     # This will match un
     #                 und
     #                 undd
     #                 unddd (etc)
.*   # Any string without a newline. This is because
     # the . matches anything except a newline and
     # the * means zero or more of these.
^$   # A line with nothing in it.
There are even more options. Square brackets are used to match any one of the characters inside them. Inside square brackets a - indicates "between" and a ^ at the beginning means "not":
[qjk]      # Either q or j or k
[^qjk]          # Neither q nor j nor k
[a-z]      # Anything from a to z inclusive
[^a-z]          # No lower case letters
[a-zA-Z]   # Any letter
[a-z]+          # Any non-zero sequence of lower case letters
At this point you can probably skip to the end and do at least most of the exercise. The rest is mostly just for reference.
A vertical bar | represents an "or" and parentheses (...) can be used to group things together:
jelly|cream     # Either jelly or cream
(eg|le)gs  # Either eggs or legs
(da)+      # Either da or dada or dadada or...
Here are some more special characters:
\n         # A newline
\t         # A tab
\w         # Any alphanumeric (word) character.
           # The same as [a-zA-Z0-9_]
\W         # Any non-word character.
           # The same as [^a-zA-Z0-9_]
\d         # Any digit. The same as [0-9]
\D         # Any non-digit. The same as [^0-9]
\s         # Any whitespace character: space,
           # tab, newline, etc
\S         # Any non-whitespace character
\b         # A word boundary, outside [] only
\B         # No word boundary
Clearly characters like $, |, [, ), \, / and so on are peculiar cases in regular expressions. If you want to match for one of those then you have to preceed it by a backslash. So:
\|         # Vertical bar
\[         # An open square bracket
\)         # A closing parenthesis
\*         # An asterisk
\^         # A carat symbol
\/         # A slash
\\         # A backslash
and so on.

Exercise:
Previously your program counted non-empty lines. Alter it so that instead of counting non-empty lines it counts only lines with
  • the letter x
  • the string the
  • the string the which may or may not have a capital t
  • the word the with or without a capital. Use \b to detect word boundaries.
In each case the program should print out every line, but it should only number those specified. Try to use the $_ variable to avoid using the =~ match operator explicitlya

No comments:

Post a Comment