Stanford CS Education #108
http://cslibrary.stanford.edu/108/
by Nick
Parlante copyright (c) 2000-2001
Revised
4/2001
This is document #108 in the Stanford CS Education Library -- see http://cslibrary.stanford.edu/108/ for this and other free educational CS materials. Permission is given for this document to be used, reproduced, or sold so long as this paragraph and the copyright are clearly reproduced.
Essential Perl.pdf Same file in Adode format for download
unix% perl myprog.pl
The interpreter makes one pass of the file to analyze it and if there are no syntax or other errors, the interpreter runs the Perl code. There is no "main" function -- the interpreter just executes the statements in the file starting at the top.
Following the Unix convention, the very first line in a Perl file usually looks like...
#!/usr/bin/perl -w
This special line is a hint to Unix to use the Perl interpreter to execute the Perl code. The "-w" switch turns on warnings which is generally a good idea. Use "chmod" to set the execute bit on a Perl file so it to be run right from the prompt without having to call the interpreter...
unix% chmod u+x foo.pl ## set the "execute" bit for the file
once
unix%
unix% foo.pl ##
automatically uses the perl interpreter to "run" this file
The second line in a Perl file is usually a "require" declaration that specifies what version of Perl the program expects...
#!/usr/bin/perl -w
require 5.004;
Other OS's have their own schemes to associate the Perl file with the perl interpreter. Matthias Neeracher makes a nice port of Perl to the Mac called MacPerl, and similarly Perl has been ported to just about every OS imaginable -- see the "ports" section of http://www.cpan.org/.
$x =
1;
## scalar var $x set to the number 1
$greeting =
"hello"; ## scalar
var $greeting set to the string "hello"
A variable that has not been given a value has the special value "undef" which can be detected using the "defined" operator. Undef looks like 0 when used as a number, or the empty string "" when used as a string.
$x = $x + 1 + $binky; ## $binky is effectively 0 here
if (!defined($binky)) {
print "the variable 'binky'
has not been given a value!\n";
}
$fname = "binky.txt";
$a = "Could not open the file
$fname."; ## $fname evaluated and pasted in -- neato!
$b =
'Could not open the file $fname.'; ## single quotes (') do no special
evaluation
## $a is now "Could not open the file binky.txt."
## $b is
now "Could not open the file $fname."
The characters '$' and '@' are used to trigger interpolation into strings, so they need to be escaped if you want them as ordinary chacters: "nick\@stanford.edu gets \$1".
The dot operator (.) concatenates two strings. If Perl has a number or other type when it wants a string, it just silently converts the value to string form and continues. It works the other way too -- a string such as "42" will evaluate to the integer 42 in an integer context.
$num = 42;
$string = "The " . $num . " ultimate" . "
answer";
## $string is now "The 42 ultimate answer"
The operators eq (equal) and ne (not equal) compare two strings.Do not use == to compare strings; use == to compare numbers.
$string = "hello";
($string eq ("hell" . "o")) ==>
TRUE
($string eq "HELLO") ==> FALSE
$num = 42;
($num-2 == 40) ==> TRUE
The operator lc("Hello") returns the all lower-case version ("hello"), and uc("Hello") returns the all upper-case version ("HELLO").
@array = (1, 2, "hello"); ## a 3 element array
@empty
=
();
## the array with 0 elements
$x = 1;
$y = 2;
@nums = ($x + $y, $x -
$y);
## @nums is now (3, -1)
Just as in C, square brackets [ ] are used to refer to elements, so $a[6] is the element at index 6 in the array @a. As in C, array index number start at 0. Notice that the syntax to access an element begins with '$' not '@' -- use '@' only when referring to the whole array (remember: all scalar expressions begin with $). Perl just makes up undef elements as needed if the program uses an index for which there is no element.
@array = (1, 2, "hello", "there");
$array[0] = $array[0] +
$array[1] + $array[27];
## $array[0] is now 3, $array[27] was undef which acts like 0
## when used in arithmetic
When used in a scalar context, an array evaluates to its length. The "scalar" operator will force the evaluation of something in a scalar context, so you can use scalar to get the length of an array. As an alternative to using scalar, the expression $#array is the index of the last element of the array which is always one less than the length.
@array = (1, 2, "hello", "there");
$len =
@array;
## $len is now 4 (the length of @array)
$len = scalar(@array) ##
same as above, since $len represented a scalar
## context anyway, but this is more explicit
@b = ("a", "b", "c");
$y =
$#b;
## $y is now 2
That scalar(@array) is the way to refer to the length of an array is not a great moment in the history of readable code. At least I haven't showed you the even more vulgar forms such as (0 + @a).
There's a variant of array assignment that is used sometimes to assign several variables at once. If an array on the left hand side of an assignment operation contains the names of variables, the variables are assigned the corresponding values from the right hand side.
($x, $y, $z) = (1, 2, "hello", 4);
## assigns $x=1, $y=2, $z="hello", and the 4 is discarded
$dict{"bart"} = "I didn't do it";
$dict{"homer"} =
"D'Oh";
$dict{"lisa"} = "";
## %dict now contains the key/value pairs ("bart" -> "I didn't do
it"),
## ("homer" -> "D'oh"), and ("lisa" -> ""))
$string = $dict{"bart"}; ## Lookup the key "bart" to
get
## the value "I didn't do it"
$dict{"homer"} = "Mmmm, scalars"; ## change the value
for the key
## "homer" to "Mmmm, scalars"
Hash arrays are convertible with arrays of even length, where each attribute is immediately followed by its value. Each key is adjacent to its value, but the order of the attribute/value pairs depends on the hashing the keys and so appears random. The "keys" operator returns an array of the keys from an associative array. The "values" operator returns an array of all the values, in an order consistent with the keys operator.
@array = %dict;
## @array will look something like
## ("homer",
"D'oh", "lisa", "", "bart", "I didn't do it");
##
##
(keys @array) looks like ("homer", "lisa, "bart")
You can use => instead of comma and so write a hash array value this cute way...
%dict = {
"bart" => "I didn't do it",
"homer" => "D'Oh",
"lisa" => "",
};
In Java or C you might create an object or struct to gather a few items
together. In Perl you might just throw those things together in a hash array.
unix% perl critic.pl -poetry poem.txt
%ENV contains the environment variables of the context that launched the Perl program. @ARGV and %ENV make the most sense in a Unix environment.
if (expr)
{
## if + elsif + else
stmt;
stmt;
}
elsif (expr)
{ ## note the
strange spelling of "elsif"
stmt;
stmt;
}
else {
stmt;
stmt;
}
unless (some-expr) { ## if variant which negates
the boolean test
stmt;
stmt;
}
$x = 3 if $x > 3; ## equivalent to: if ($x > 3) { $x = 3; }
$x = 3 unless $some_variable;
For these constructs, the parentheses are not required around the boolean expression. This may be another case where Perl is using a structure from human languages. I tend to avoid this syntax because I just cannot get used to seeing the condition after the statement it modifies. If you were defusing a bomb, would you like instructions like this: "Locate the red wire coming out of the control block and cut it. Unless it's a weekday -- in that case cut the black wire."
while (expr) {
stmt;
stmt;
}
for (init_expr; test_expr; increment_expr) {
stmt;
stmt;
}
The "next" operator forces the loop to the next iteration. The "last" operator breaks out of the loop like break in C. This is one case where Perl (last) does not use the same keyword name as C (break).
foreach $var (@array) {
stmt; ##
use $var in here
stmt;
}
Any array expression may be used in the foreach. The array expression is evaluated once before the loop starts. The iterating variable, such as $var, is actually a pointer to each element in the array, so assigning to $var will actually change the elements in the array.
The standard file handles STDIN, STDOUT, and STDERR are automatically opened before the program runs. Surrounding a file handle with < > is an expression that returns one line from the file including the "\n" character. The < > operator returns undef when there is no more input. The "chop" operator removes the last character from a string, so it can be used just after an input operation to remove the trailing "\n". The "chomp" operator is similar, but only removes the character if it is the end-of-line character.
$line = <STDIN>; ## read one line from the STDIN file
handle
chomp($line); ## remove the
trailing "\n" if present
$line2 = <FILE2>; ## read one line from the FILE2 file
handle
## which must be have been opened previously
Since the input operator returns undef at the end of the file, the standard pattern to read all the lines in a file is...
## read every line of a file
while (defined($line =
<STDIN>)) {
## do something with $line
}
open(F1, "filename"); ## open "filename"
for reading as file handle F1
open(F2,
">filename"); ## open "filename" for writing as file
handle F2
open(F3, ">>appendtome") ## open
"appendtome" for appending
close(F1); ## close a file handle
Open can also be used to establish a reading or writing connection to a separate process launched by the OS. This works best on Unix.
open(F4, "ls -l |"); ## open a pipe
to read from an ls process
open(F5, "| mail $addr"); ## open
a pipe to write to a mail process
Passing commands to the shell to launch an OS processes in this way can be very convenient, but it's also a famous source of security problems in CGI programs, so be very careful what you ask the OS to launch when writing CGIs.
Open returns undef on failure, so the following phrase is often to exit if a file can't be opened. The die operator prints an error message and terminates the program.
open(FILE, $fname) || die "Could not open $fname\n";
In this example, the logical-or operator || essentially builds an if
statement. It's a strange construction, but it's a common error-handling pattern
in Perl.
@a = <FILE>; ## read the whole file in as an array of
lines
The behavior of <FILE> also depends on the special variable $/ which is what Perl thinks is current the end-of-line marker (usually "\n"). Setting $/ to undef causes <FILE> to read the whole file into a single string.
$/ = undef;
$all = <FILE>; ## read
the whole file into one string
You can remember that $/ is the end-of-line marker because "/" is used to designate separate lines of poetry. I thought this mnemonic was silly when I first saw it, but sure enough, I have never forgotten that $/ is the end-of-line marker.
print "Woo Hoo\n"; ## print a string to STDOUT
$num = 42;
$str = " Hoo";
print "Woo", $a, "
bbb $num", "\n"; ## print comma separated strings
An optional first argument to print can specify the destination file handle. There is no comma after the file handle, but I always forget to omit it.
print FILE "Clattu", " barada", " nickto!", "\n"; ## no comma after
FILE
#!/usr/bin/perl -w
require 5.004;
## Open each
command line file and print its contents to standard out
foreach $fname (@ARGV) {
open(FILE, $fname) || die("Could not open $fname\n");
while(defined($line = <FILE>)) {
print $line;
}
close(FILE);
}
The above uses "die" to abort the program if one of the files cannot be opened. We could use a more flexible strategy where we print an error message for that file but continue to try to process the other files. Alternately we could use the function call exit(-1) to exit the program with an error code. Also, the following shift pattern is a common alternative way to iterate through an array...
while($fname = shift(@ARGV)) {...
($string =~ /pattern/) ## true if the pattern is found somwhere in the string
("binky" =~ /ink/) ==> TRUE
("binky" =~
/onk/) ==> FALSE
In the simplest case, the exact characters in the regular expression pattern must occur in the string somewhere. All of the characters in the pattern must be matched, but the pattern does not need to be right at the start or end of the string, and the pattern does not need to use all the characters in the string.
The modifier "i" after the last / means the match should be case insensitive...
"PiIIg" =~ /pIiig/ ==> FALSE
"PiIIg" =~ /pIiig/i ==> TRUE
There is an optional "m" (for "match") that comes before the first /. If the "m" is used, then any character can be used for the delimiter instead of / -- so you could use " or + to delimit the expressions. This is handy if what you are trying to match has a lot of /'s in it. If the delimiter is the single quote (') then interpolation is suppressed. All of the following are the same...
"piiig" =~ m/piiig/
"piiig" =~ m"piiig"
"piiig"
=~ m+piiig+
#### search for the RE 'iiig' in the string 'piiig'
"piiig"
=~ m/iiig/ ==> TRUE
#### the RE may be anywhere inside the string
"piiig" =~
m/iii/ ==> TRUE
#### all of the RE must match
"piiig" =~ m/iiii/ ==>
FALSE
#### . = any char but \n
"piiig" =~ m/...ig/ ==>
TRUE
"piiig" =~ m/p.i../ ==> TRUE
"piiig" =~ m/p.i.../ ==> FALSE
#### \d = digit
"p123g" =~ m/p\d\d\dg/ ==> TRUE
"p123gp\d\d\d\d" =~ m// ==> TRUE
#### \w = letter or digit
"p123g" =~ m/\w\w\w\w\w/ ==>
TRUE
#### i+ = one or more i's
"piiig" =~ m/pi+g/ ==>
TRUE
"piiig" =~ m/i+/ ==> TRUE
"piiig" =~ m/p+i+g+/ ==> TRUE
"piiig" =~ m/p+g+/ ==> FALSE
#### i* = zero or more i's
"piiig" =~ m/pi*g/ ==>
TRUE
"piiig" =~ m/p*i*g*/ ==> TRUE
"piiig" =~ m/pi*X*g/ ==> TRUE
#### ^ = start, $ = end
"piiig" =~ m/^pi+g$/ ==>
TRUE
"piiig" =~ m/^i+g$/ ==> FALSE
"piiig" =~ m/^pi+$/ ==> FALSE
"piiig" =~ m/^p.+g$/ ==> TRUE
"piiig" =~ m/^p.+$/ ==> TRUE
"piiig" =~ m/^.+$/ ==> TRUE
"piiig" =~ m/^g.+$/ ==> FALSE
#### Needs at least one char after the g
"piiig" =~ m/g.+/
==> FALSE
#### Needs at least zero chars after the g
"piiig" =~ m/g.*/
==> TRUE
#### | = one or the other
"cat" =~ m/^(cat|hat)$/ ==>
TRUE
"hat" =~ m/^(cat|hat)$/ ==> TRUE
"cathatcatcat" =~ m/^(cat|hat)+$/ ==> TRUE
"cathatcatcat" =~ m/^(c|a|t|h)+$/ ==> TRUE
"cathatcatcat" =~ m/^(c|a|t)+$/ ==> FALSE
#### Matches and stops at 'cat' on the left; does not get to 'catcat' on
the right
"cathatcatcat" =~ m/(c|a|t)+/ ==> TRUE
#### ? = optional
"<><><x><><x>" =~ m/^(<x?>)+$/
==> TRUE
"aaaxbbbabaxbb" =~ m/^(a+x?b+)+$/ ==> TRUE
"aaaxxbbb" =~ m/^(a+x?b+)+$/ ==> FALSE
#### words separated by spaces -- \s = space, tab, or newline
"easy does it" =~ m/^\w+\s+\w+\s+\w+$/
==> TRUE
#### Just matches "gates@microsoft" -- \w does not match the "."
"bill.gatese@microsoft.com" =~ m/\w+@\w+/ ==> TRUE
#### add the .'s to get the whole thing
"bill.gatese@microsoft.com" =~ m/^(\w|\.)+@(\w|\.)+$/ ==> TRUE
#### words separated by commas and possibly spaces
"Klaatu, barada,nikto" =~ m/^\w+(,\s*\w+)*$/ ==>
TRUE
The parts of an email addres on either side of the "@" are made up of letters, numbers plus dots, underbars, and dashes. As a character class that's just [\w._-].
"bill.gates_emporer@microsoft.com" =~ m/^[\w._-]+@[\w._-]+$/ ==>
TRUE
if ("this and that" =~ /(\w+)\s+(\w+)\s+(\w+)/) {
## if the above matches, $1=="this", $2=="and", $3=="that"
This is a nice way to parse a string -- write a regular expression for the pattern you expect putting parenthesis around the parts you want to pull out. Only use $1, $2, etc. when the if =~ returns true. There are three other special variables: $& (dollar-ampersand) = the matched string, $` (dollar-back-quote) = the string before what was matched, and $' (dollar-quote) = the string following what was matched.
The following loop rips through a string and pulls out all the email addresses. It demonstrates using a character class, using $1 etc. to pull out parts of the match string, and using $' after the match.
$str = 'blah blah nick@cs.stanford.edu, blah blah balh billg@microsoft.com blah blah';
while ($str =~ /(([\w._-]+)\@([\w._-]+))/) { ## look for an email
addr
print "user:$2 host:$3
all:$1\n"; ## parts of the
addr
$str = $'; ## set
the str to be the "rest" of the string
}
output:
user:nick host:cs.stanford.edu
all:nick@cs.stanford.edu
user:billg host:microsoft.com
all:billg@microsoft.com
## Change all "is" strings to "is not" -- a sure way to improve any
document
$str =~ s/is/is not/ig;
The "g" modifier after the last / means do the replacement repeatedly in the target string. The modifier "i" means the match should not be case sensitive. The following example finds instances of the letter "r" or "l" followed by a word character, and replaces that patter with "w" followed by the same word character. Sounds like Tweety Bird...
## Change "r" and "l" followed by a word char to "w" followed
## by the same word char
$x = "This dress exacerbates the
genetic betrayal that is my Legacy.\n";
$x =~
s/(r|l)(\w)/w$2/ig; ## r or l followed by a word char
## $x is now "This dwess exacewbates the genetic betwayal that is my
wegacy."
m/{(.*)}/ -- pick up all the characters between {}'s
The problem is that if you match against the string "{group 1} xx {group 2}", the * will aggressively run right over the first } and match the second }. So $1 will be "group 1} xx {group 2" instead of "group 1". Fortunately Perl has a nice solution to the too-aggressive-*/+ problem. If a ? immediately follows the * or +, then it tries to find the shortest repetition which works instead of the longest. You need the ? variant most often when matching with .* or \S* which can easily use up more than you had in mind. Use ".*?" to skip over stuff you don't care about, but have something you do care about immediately to its right. Such as..
m/{(.*?)}/ ## pick up all the characters between {}'s, but stop
## at the
first }
The old way to skip everything up until a certain character, say }, uses the [^}] construct like this...
m/{([^}]*)}/ ## the inner [^}] matches any char except }
I prefer the (.*?) form. In fact, I suspect it was added to the language precisely as an improvement over the [^}]* form.
$count = 0;
$pos = 0;
while ( ($pos =
index($string, "binky", $pos) != -1) {
$count++;
$pos++;
}
The function substr(string, offset, length) pulls a substring out of
the given string. Substr() starts at the given offset and continues for the
given length.
split(/\s*,\s*/, "dress ,
betrayal , legacy") ## returns the array
("dress", "betrayal", "legacy")
Split is often a useful way to pull an enumeration out of some text for processing. If the number -1 is passed as a third argument to split, then it will interpret an instance of the separator pattern at the end of the string as marking a last, empty element (note the comma after the last word)...
split(/\s*,\s*/, "dress ,
betrayal , legacy,", -1) ## returns the array
("dress", "betrayal", "legacy", "")
$string =~ tr/a/b/; -- change all a's to b's
$string
=~ tr/A-Z/a-z/; -- change uppercase to lowercase (actually lc() is
better for this)
$x = Three(); ## call to Three() returns 3
exit(0); ## exit the program
normally
sub Three {
return (1 + 2);
}
sub Three {
my ($x, $y); # declares "local"
vars $x and $y
$x = 1;
$y = 2;
return ($x + $y);
}
# Variant of Three() which inits $x and $y with the array trick
sub Three2 {
my ($x, $y) = (1, 2);
return ($x + $y);
}
sub Sum1 {
my ($x, $y) = @_; # the first lines of many
functions look like this
# to retrieve and name their params
return($x + $y);
}
# Variant where you pull the values out of @_ directly
#
This avoids copying the parameters
sub Sum2 {
return($_[0] + $_[1]);
}
# How Sum() would really be written in Perl -- it takes an array
# of numbers of arbitrary length, and adds all of them...
sub Sum3 {
my ($sum, $elem); # declare local
vars
$sum = 0;
foreach $elem (@_) {
$sum += $elem;
}
return($sum);
}
## Variant of above using shift instead of foreach
sub sum4
{
my ($sum, $elem);
$sum = 0;
while(defined($elem = shift(@_))) {
$sum += $elem;
}
return($sum);
}
open(FILE, ">file.txt");
SayHello("FILE");
close(FILE);
## Here, the file handle FILE is passed as the string "FILE"
sub SayHello {
my($file_handle) = @_;
## Prints to the file handle identified in $file_handle
print $file_handle "I'm a little teapot, short and
stout.\n";
}
Actually, the file handle doesn't even need to be quoted in the call, so the above call could be written as SayHello(FILE);. This is the "bareword" feature of Perl where a group of characters that does not have another syntactic interpretation is passed through as if it were a string. I prefer not to rely on barewords, so I write the call as SayHello("FILE");.
# Suppose this function returns a (num, string) array
#
where the num is a result code and the string is
# the human
readable form
sub DoSomething {
# does
something
return(-13, "Core Breach Imminent!!"); # return an
array len 2
}
# so a call would look like...
my ($num, $string) =
DoSomething();
if ($num<0) {
print
"Panic:$string\n";
}
The values returned must be scalars — if they themselves are arrays, they will be flattened into the return array which is probably not what you want.
Sum3(1, 2, (3, 4));
## returns 10 -- the arg array is
flattened to (1, 2, 3, 4)
This flattening can hurt you if you try to assign to an element which is an array...
my(@nums, $three) = ((1, 2), 3);
You might think that this assigns (1, 2) to @nums and 3 to $three. But instead the right hand side gets flattened to (1, 2, 3) which is then assigned to @nums, and $three does not get a value. Only use the my($x, $y) = (...); form when assigning a bunch of scalar values. If any of the values are arrays, then you should separate out all the assignments, each on its own line...
my ($x, $y);
$x = ..;
$y = ..;
You can get around the 1-deep by storing references to arrays in other arrays -- see the References section.
You can enable a feature in the interpreter to check for global variable references with the following declaration (usually up at the top of the code)...
use strict 'vars';
In that case, all references to global variables must be qualified with a package name. By default, global variables go in a package named "main" and qualified references to them look like..
$main::x = 0;
## qualified reference to global $x in the "main" package
$::x =
0;
## as a special case, "main" may be omitted
@foo::nums = (1, 2, 3);
## qualified reference to @nums in the "foo" package
You can put your code in a named package with the following line at the top of your file..
package
Binky; ## now your
globals are qualified as Binky::
Suppose there is a string...
$str = "hello"; ## original string
And there is a reference that points to that string...
$ref = \$str; ## compute $ref that points to $str
The expression to access $str is $$ref. Essentially, the alphabetic part of the variable, 'str', is replaced with the dereference expression '$ref'...
print "$$ref\n"; ## prints "hello" -- identical to "$str\n";
Here's an example of the same principle with a reference to an array...
@a = (1, 2, 3); ## original array
$aRef = \@a; ## reference to the array
print "a: @a\n"; ## prints "a:
1 2 3"
print "a: @$aRef\n"; ## exactly the
same
Curly braces { } can be added in code and in strings to help clarify the stack of @, $, ...
print "a: @{$aRef}\n"; ## use { } for clarity
Here's how you put references to arrays in another array to make it look two dimensional...
@a = (1, 2, 3);
@b = (4, 5, 6);
@root = (\@a,
\@b);
print "a:
@a\n"; ## a:
(1 2 3)
print "a: @{$root[0]}\n"; ## a: (1 2 3)
print "b: @{$root[1]}\n"; ## b: (4 5 6)
scalar(@root) ## root len
== 2
scalar(@{$root[0]}) ## a len: == 3
For arrays of arrays, the [ ] operations can stack together so the syntax is more C like...
$root[1][0]
## this is 4
while (defined($line = <FILE>)) {
print $line;
}
Can be written as...
while (defined(<FILE>)) {
print;
}
It turns out that <FILE> assigns its value into $_ if no variable is
specified, and likewise print reads from $_ if nothing is specified. Perl is
filled with little shortcuts like that, so many phrases can be written more
tersely by omitting explicit variables. I don't especially like the "short"
style, since I actually like having named variables in my code, but obviously it
depends on personal taste and the goal for the code. Is the code going to be
maintained or debugged by someone else in the future, then having named
variables seems like a good idea.