Notes from Mastering Perl for Bioinformatics by James Tisdall. O Reilly & Assoc S 2003

Perl language notes 1

Notes from Mastering Perl for Bioinformatics by James Tisdall. O’Reilly & Assoc’s 2003.

Web page for book = www.oreilly.com/catalog/mperlbio

Basics:

Run perl programs using perl prog_name

Edit in MSWord saving as text only with line breaks.

Default file extension is .pl. Modules or classes must use .pm.

Edit data files the same way, but use text only if inputs are longer than one line, like sequence files.

Begin files with ref to perl program #!/usr/bin/perl;

Current folder & standard perl folders checked when calls to access file made.

use ref /usr/etc/etc/folder_name; allows other local folder to be accessed, or on command line by

perl -I/dir/dir/lastdir progam.pl

use warnings; gives debugging info & use strict; forces my to define variables.

# is used for notes (rest of line after # is ignored by program)

Variables & functions to manipulate them:

$var is scalar (auto parsed to # or text, use ‘###’ to treat # as text)

Can assign multiple things at once w/ ()’s ($a,$b,$c) = (1,2,3).

Arrays: @arr = (1, “2”, ‘cow’, “frog”, $var1) makes 1D array. @a=qw(a b c) omits need for “” & ,.

Access elements by $elem = $arr[0] where 0 is 1st element

Using array where Perl expects a scalar value returns # of elements in array, e.g. if (@arr <5)

Can provide range to get multiple elements in array @arr[1,2] @arr[0..4] @arr[3..$#arr] or even [1../search_pattern/].

$count = @array gives # of elements in array, or use $#array: gives position of last element

delete $array[2..4] deletes array elements 2-4. Can work w/ hash keys too.

exists $array[x] true if array element (or $hash{key}) exists.

join expr list joins elements of list/array separated by expr

pop array remove & return last element of array

push @array list put single or multiple elements of list onto end of array

reverse @array reverses order or array (w/o sorting), in scalar context reverses string

shift @array remove & return first element

sort(@arr) sorts array (reverse(@arr) does reverse? without sorting?).

splice array, offset, ln, list remove ln elements of array from offset & replace w/ list

unshift array, list add list to beginning of array

Hashes: %h = (‘key1’, ‘val1’, key2=>’val2’) makes hash, w/ => equiv to “,” but allows you to not put quotes around key2.

Access elements by $value = $h{‘key_name’}

Get all keys with @keys = keys %h or values with @vals = values %h

my $var defines $var locally within subroutine or file where my is called- outside of which it is removed from memory UNLESS it is referred to by any subroutine (a “closure”). If that subroutine(s) is enclosed in the same block it is the only way to access that variable… e.g.

{ my $var; sub up_var ($var++}; sub get_var{print “$var”} } often used in OO modules

Default variables (e.g. things passed to subroutines) $_ (for 1) or @_

$! (error message), $& string returned from =~ binding functions.

@ARGV is array of command line arguments to script.

Note for $obj=OO_Class->new calls 1st element in @_ is ‘OO_Class’ & any subsequent $obj->method(args) $obj (reference) is 1st argument passed.

Variables must be initialized (assigned w/ =) before they can be used in calcs or print statements etc. For scalars =’’ or 0 works, for arrays & hashes @a or %a= () OK. Or w/ refs $a = [] or {}.

Closures variables

References:

$ref = \$var gives ref to memory location of $var

$value_of_var = $$ref (dereferenced by $, or @ or % for array or hash refs)

$array_ref = [0,1,2,3] makes ref to anonomous array

access whole array with @arr=@$array_ref

access element w/ $val =$$array_ref[0] (returns 0) OR =$array_ref->[0]

$hash_ref = {key => ‘val’, key2 =>’val2’} makes ref to anon hash

access whole hash with %$hash_ref, or data w/ $$hash_ref{key} OR $hash_ref->{key}

can, if desired, make this clearer with {}’s, e.g. $$ref equals ${$ref}

ref EXPR if EXPR is reference returns type of thing it points to e.g. SCALAR, ARRAY, REF, HASH or OO_Class1 if it has been blessed by an OO module, else returns false.

Complex data structures:

Matrices: Can specify 1) directly with $arr[x][y]=, or…

2) by filling array with refs to other arrays. Simplest @arr = ([1,2],[3,4]) which puts anon arrays into array. Better: define refs $a=[1,2]; $b=[3,4] & put in array @arr=($a,$b). Doing this (I think) lets you pull back elements normally as $arr[x][y]…

3) Most flexibly, by making everything a reference, so $arr=[[1,2],[3,4]] or $arr=[$a,$b]. If so, need to derefrence to access elements, as $$arr[x][y] or $arr->[x][y] or @{$arr->[x]} to get array from posit x. If everything is always a reference, it’s most flexible, allowing complex mixed data structures, such as $mess = [1,{k1=>’hi’,k2=>[“what”,”the”,”hell?”]},[1,2, [9,8.7]],”end”], that can be accessed like $mess->[0] gives 1, ${$mess->[1]}{k1} gives “hi’, ${${$mess->[1]}{k2}}[2] gives “hell?” & ${${$mess->[2]}[2]}[0] is 9. Most sensibly these could also be written with many arrows $mess->[1]->{k2}->[2] or @mess->[1]->{k2}.

Order appears to be from inside out or left to right. So pointer to top level array position 1st, etc.

Still darn confusing re when defreferencing is needed, etc.

Operators:

Logical: not (or !) (returns true if something is false) see If statement

and (), or (||)- meaning either, xor (meaning only one not both)

Note statement1 or statement2 only executes statement 2 if 1 is false.

Comparison: == (for #’s) or eq (for strings), != (or ne for strings) also < (lt), <= (le), >= (ge), > (gt)

Assignament: $a = $b assigns value of $b to $a, $a++ or $a-- (increment or decrement)

$a+=$b ($a=$a+$b), also -=,*=,/=($a/$b), **=($a raised to $b) & %= remainder of $a/$b, for strings $a.=$b appends $b to $a, $a x=$b (repeat $a $b times)

Common programming functions:

die args end the program printing args

for (intial condit; continue so long as this is true; do each iteration) {statements;}

e.g. for($i=1;$i<10;$i++){} **Warning!! Must use semicolons!

also foreach var (list or array) {block} for each element in list/array passed to var

also while(condition) {}, until(condition){} & do {block} while/until (condition)

next; skips to next iteration early

if (logical test) {statement;} elseif {statement;} else {statement;} logical test fails if # =0, string eq “”, or arrays & hashes are empty or if false special key returned, not true

unless same as if(not test) {}

localtime gives local time, useful for timestamps, also gmtime for greenwich mean

print OPTIONAL_FILEHANDLE “text “,”next text “, “text$variable\n”, @arr, “@arr”

note, in “”’s (rather than ‘ ‘) processes \n (newline) \t (tab) & $varaible contents

@arr w/o “” no spaces, with “” spaces. printf allows formatting (would need look up)

package Name; Establishes package outside of which the same $var can be given different values. Leave pkg when new package declaration made, or at end of {} or module where declaration made. Values from each pkg can be returned by $Name::var or $Name2::var, etc.

sub subroutine_name {} can call with subroutine_name(args) feeding args to @_. Note old syntax (still allowed) prepends & (e.g. &subroutine_name(args)). & is ignored.

Returns last thing assigned before last } or earlier if use return (args);

Note subroutines are accessible even if hidden in a block of code that never executes & a global variable referred to in a sub (not marked by my within the subroutine) is never closed due to going out of bounds.

Math & other simple functions:

abs number returns absolute value, atan2 Y,X arcan Y/X, cos $in_radians, exp EXPR e to the EXPR, hex EXPR returns decimal val from hex, int EXPR integer, log EXPR natural log, rand EXPR pseudorandom val 0-EXPR or 0-1 (if no EXPR), sin EXPR, sqrt EXPR,

File handling:

Files can be passed to program w/ perl program file1 file2 (can omit perl?)

Or open(FILEHANDLE, “file_name”); accessed by @arr=<STDIN>

Can do (FH, “<”, “file_name”) to indicate input, “>” ouput or “>”ouput append to existing

Lines of file is in <FILEHANDLE> array & accessed by @arr= <FILEHANDLE>;

foreach $val <FILEHANDLE> {} steps thru file assigning each line to val

When finished close(FILEHANDLE);

read (FH, scalar, length, offset) puts data of length from current position w/ optional offset into scalar

rename oldname newname to rename file

seek FILEHANDLE, OFFSET, WHENCE posits file pointer to offset bytes, if whence 1 offset added to current posit, if 2 offset subtracted from end (could use to reset pointer by seek(FH,0) ??

tell FH gives current position in bytes

Position in file can be reffed by range e.g. while <FH> {if (1.. /search pattern/) {next;}, where 1 is 1st line

Text handling & modification

Binding operators:

Search: $a =~ /pattern/ returns pattern in $& special var if pattern in $a (but doesn’t change $a?), same as m/pattern. pos $a gives position in string where last m// search left off.

Substitute: $a =~ s/pattern1/pattern2/ replaces 1st 1 w/ 2

Transpose: $a =~ tr/123/567/ converts all 1’s to 5’s etc. in A (can use for DNA complent)

Modifiers: / … /g (match all instances), //s (let . match newline) //I (ignore up/low case), d??

Special chars & ranges:

. (any char) \s (whitespace), \S (nonwhitspace), \d (digit 0-9)

[1234] (any of this set), [^1234] (any not in set), (wd1|wd2|wd3) (any of these 3 words)

^ line start, $ line end.

Groups indicated by multiple ()’s will output to $1, $2 etc. w/ 1st “(“ encountered ->$1.

Prepending \ allows search for Metacharacters \|(){{^$*+? or . (e.g. /\\/ finds “\”

Quantifiers: * (0 or more of thing, e.g. x*), + (1 or more), ? (0 or 1) {3} 3, {3,6} 3 to 6, {3,} 3 or more. Generally return max, so ‘ABCCCCD’ =~ /A.*C/ gives ABCCCC. To get shortest string append ? to quantifier, so =~/A.*?C/ gives ABC.

chomp $str or list/array removes terminal newlines from $str or array, chop removes last char

index string substring returns position of 1st instance, rindex gives last instance

lc EXPR returns lower case

length EXPR gives length in characters

reverse $string reverses

split /pattern/,$str returns array of $str split at every /pattern/. if pattern is omitted uses white spaces as pattern

substr($string,offset, length,replacement) offset = start posit-1 (e.g. substr($s,0,1) gives 1st char. Negative offset = distance from right end. Length omitted -> $str end. Length negative, leave # chars to end off. So… substr(“ABCDEF”, 2, -2) gives ‘CD’. If replacemnet specified, replace substr w/ it.

uc($str) returns upper case of $str

Modules

Must end with 1; as last statement and file name must end with .pm (e.g. module1.pm)

Made accesible by use module1; can specify subdiretory like so use dir::subdir:module1

Subroutines in module called by program using module1::subrout(args)

Built-in modules

AUTOLOAD; use vars ‘$AUTOLOAD’; or our $AUTOLOAD in Perl 5.6 or greater, then any call to an undefined subroutine calls up autoload passing subroutine name, typically in “operation_attribute” form, in $AUTOLOAD + any args)

sub AUTOLOAD {my ($self, @args) = @_; my ($operation, $attribute) = ($AUTOLOAD =~ /(get|set)(_\w+)$/); if($operation eq ‘get’ AND exists $self->{$attribute}) {

no strict ‘refs’; *{$AUTOLOAD} = sub {shift->{$attribute}; no strict ‘refs’;

(this turns off strict briefly, uses * to put $AUTOLOAD value as subroutine in symbols table, uses shift (on default @_ to pull $obj & accesses $attribute key in $obj, then toggles off strict)

return $self->$attribute; } (performs proper subroutine function 1st time called, in later calls subroutine will have been defined by AUTOLOAD).

Carp: use Carp gives carp(statement) prints more detailed error message & croak(statement) does this & dies.

DESTROY need not be defined by use statement, automatically removes any variable that is out of scope (e.g. locally defined by my in {}’s). Can define sub DESTROY {thing to do;} to have other things, such as decreasing running count of data objects, called when DESTROY activates.

DB_File: use DB_File; allows database file of hash to be stored in memory after exit from program, see perlman DB_File for details. Uses tie (%hash, ‘DB_File’, $file_name, flags, mode, $DB_HASH) where ‘DB_File’ & $DB_HASH must be verbatim, flags can be of type O_RDWR | O_CREAT & mode is 0444 (??). Apparently DB files are space delimited

CPAN modules

For more info use perldoc CPAN, for info about installed module perldoc mod_dir::mod_name

Find modules by browsing www.cpan.org

Insall using perl –MCPAN –e ‘install module_dir::module_name’

Objects, Classes, Methods & Object Oriented Programming

Object is a datastructure blessed by OO module (& thus part of a Class) by calling new

Methods are defined in the class & are only “legit” way of accessing data in objects

Classes of objects are defind & managed in special module .pm files

Begin with package Class1; statement where Class1 is filename without .pm

After other use calls then…

{ my %attribute_table = (_name => [‘default_value’, ‘permissions.e.g.read.write’, _dat=> etc.);

sub _all_attributes {keys %_attribute_table;} }(this sub call w/in same {} sets closure on %_attribute table, etc., note prepended underscores indicates something accessible only in module & not intended to be accessed by calling program)

sub new {

my ($class, % arg) = @_; (1st arg when OO module sub called is always class name)

… (tests to confirm keys in %arg are defined in %attribute table etc.)

return bless( { _name => $arg{name} || die, _dat => $arg{dat} }, $class || “?”);

} (note, here use || or to give default in case of failure

sub get_name { $_[0] -> {_name} }; (a method: default is to return resulting value ->assoc’d with key)

sub get_dat { $_[0]-> {_dat} }; (note, $obj->get_dat call, receives $obj ref as 1st arg, then any others)

sub set_name{

my($input_obj, $name) = @_;

if $name { $input_obj->{_name} = $name;

}

(changes ‘keyname’ that might be specified in calling program to “_keyname”, which will prevent their direct access except through subs in class1. Bless associates object with Class1 so that $obj->subroutine() calls use Methods in Class1.)

=head1 Documentation 1

Descriptors

=head1 more documentaiton

Examples etc.

=cut (this is “POD”/”plain old documentation”; everyting between =head1 & = cut ignored by program, but is called up with perldoc module_name)

Program that makes Class 1 objects & manipulates them