Perl for Science Majors
Designed to support CMPS 1023
V .61
Image from National Human Genome Research Institute
By
Richard Simpson and Tina Johnson
Midwestern State University
2010
CONTENTS
- What is Perl and DOS
- Installing and using Notepad++
- Installing and running Perl
- Variables and Data Types in Perl
- Input\Output
- Mathematical operators
- Simple Programs
- The IF statement and Logical operators
- The WHILE statement
- File Input\Output
- Regular Expressions
- Searching Text files using RE’s
- More Arrays in Perl
- Hashes and their uses
1. What is Perl
Perl (not PERL) is a programming language developed by Larry Wall in 1987. Although it was originally intended as a UNIX scripting language for UNIX systems, it has evolved to become a highly used programming language for text processing. This is not to say that it cannot process numerical applications but that it has features that make the processing of text files relatively easy. Computer Scientists refer to a language that can be used to solve almost any problem as a general-purpose language, which is what Perl is. Besides applications in graphics programming, system administration, database systems, Web and networks, the field of Bioinformatics has embraced the language as their (at least the most popular one) preferred programming tool for DNA and other text processing. This little book will restrict its focus to applications in Biology, Chemistry, Physics, Mathematics and Geology which is what the majority of science students major in.
We will be working in a unfamiliar environment this semester at least from your perspective. This environment, which is our interface to the operating system, is called command line DOS. It looks very much like the command line interface found in the Unix and Linux operating systems so what you learn here will help you when and if you work on these systems.
DOS (Disk Operating System) has been around since 1980 or so. It was developed to work with the new desktop microcomputers that hit the shelves about this time. The version that Bill Gates sold was called MS-DOS. Its popularity and subsequent Windows operating systems is what made Microsoft and its owner Bill Gates so wealthy. The fact that DOS is command line based means that the user interacts with the OS by entering commands on a line, one at a time. There is no mouse or GUI (graphical user interface) that has icons and etc. to click on. It’s all done thru the keyboard. Although the original DOS was displayed on the entire screen, modern DOS is normally run within a window. To create this DOS port, click on the start button of windows, select run from the menu and enter cmd within the opened popup followed by OK. You should get a window that looks like this
The line
c:\Documents and Settings\richard.simpson
is called the prompt. It indicates the directory that DOS is presently accessing, also known as the present working directory (pwd). In order to display the actual contents of this directory, just type in the command dir. The results of executing dir within DOS on a laptop are given below.
The directory contents and the initial directory will most likely be different on your computer. In this case let’s look at what is displayed. Note the line
06/20/2010 12:04 PM 11,149 gsview32.ini
in the display. This line gives the creation date and time as well as its size in bytes for the file gsview32.ini. The data is changed/updated each time the file is modified. Other lines such as
02/02/2009 11:46 AM <DIR> mydir
refer to directories as indicated by the <DIR> field.
As you first learn to use DOS it might be wise to hide your mouse so it is not within easy reach. Remember you are not supposed to use the mouse while interacting with the DOS interface. Of course if you need to work on another window you can use the mouse to click on some other part of the background windows GUI.
So what can we do in this command line window.? The basic process is type a command and enter to see its effect and do this over and over again. Although there are quite a few commands you can use, as given in the DOS command appendix we will look at a few really useful ones here. As we discuss these commands don’t forget that pwd is shorthand for the present working directory.
There is one command that allows the user to move around the directory tree. This command is called cd (change directory) and can be used in several ways as given by the following examples
cd .. change to the parent dir ( the .. is shorthand for the parent )
cd species change to the species dir of the pwd. (note that there is no slash)
cd /bin change to the bin directory of the root.
ch /Simpson/Files/Perl/ change to the indicated directory if it exists within the dir tree.
Each time you execute one of the above you should probably follow it with a dir command to see its contents. The command cd species shown above will only work if a species directory is displayed when the dir command is executed, i.e. species is a dir in the pwd. As an aside if you want to change drives (for example C: to D: where D: is your thumb drive) just type D: at the prompt without the cd.
The creation and deletion of directories is straight forward as well. Just type mkdir dir_name to create a new dir within the pwd. You can create as many directories and subdirectories as you like with this method. If you want a to create a subdirectory in say directory FilesList you must make FileList the cwd before you execute mkdir .
Another command line command that is useful is the type command( as in type file_name). This command is used to display the contents of text files (those made from ASCII codes only) that you see within the pwd. It will not work properly on .exe files, .doc files, .pdf file as well as many others. If you don’t believe me just type a .exe file and see what you get. If things go crazy while attempting something lke this just type cntl-C to kill the confusion and get back to the prompt.
The final set of commands that we will discuss here is the move and copy commands. The move command lets you move a file from one directory to another with the original being deleted. The copy command will do the same but keep BOTH copies. In order to give some examples assume that we have the tree displayed in homework 1.3 below and that the pwd is Exams. We will only discuss copy since the move would be similar. The command that would copy A.txt to the Exams directory( ie the pwd) would be
copy /Text files/Letters/A.txt . (Note: the . is shorthand for the pwd)
In this case the entire path /Text file/Letters/A.txt starting at the root, was used to select the file we want to copy. You could also have given the full path of the receiving directory as in
copy /Text files/Letters/A.txt /Textfiles/Exams/
Assuming that we are in Letters here is a command that will copy a file to its parent directory.
copy b.txt .. (Remember that .. always represents the parent directory of the pwd)
There are many variations of this command. What do you think this means?
copy ../Letters/A.txt . (assuming that we are in Exams)
You start where you are and back up one directory, then go down into Letters to retrieve A.txt, copying it to the pwd. (GOT that?)
Homework
1.1 Open a DOS window and note the pwd. Remember its path is displayed in the prompt. Now move to the root by running cd \ and note that the new prompt should be C:\> Now move around the directory tree by executing dir and cd commands, cd name to go down into the directory and cd .. to back up. Move around the tree until you become comfortable with this process.
1.2 Insert a thumb drive (AKA geek stick)(AKA computer sticky thingy from a recent movie) into one of the USB ports. Use one that has no important information you want to keep. Move DOS to this drive by entering the drive letter into the prompt. For example if D is the drive letter that the OS assigned to the thumb drive then just type D: at the prompt and return. After running dir to see what’s on the drive, delete all the files and directories on the drive, one at a time. Note that if you try to delete a directory that is not empty DOS will inform you of this. If so change to that dir and delete the files in it first and then back out, via cd .. , to the parent again. Now you can delete it.
1.3 Starting with an empty thumb drive build the following dir tree. The rectangles are directories and the circles are files. Go to drive C and find a small .exe file and copy it to the bin directory as shown in the tree. Also copy a .doc (or .docx) from from drive C to your exams directory. Within the Letters directory run Notepad at the command line and create two files, with a sentence or two of data. Now go to the root directory and run the tree (ie type tree and return) command. You should see this tree drawn on its side.
1.4 Starting with the above tree do the following
a) copy the a.txt file to the Perl programs dir
b) move the .exe file to the root dir and to the Other files directory.
c) copy the B.txt file to Text files.
2. Installing and using Notepad++
In order to write and run Perl programs, there are two programs that will need to be downloaded and installed on your home system. The computers in our class and in many of the labs here at MSU already have these tools loaded for you convenience.
The first tool we need to download is called Notepad++. This is a FREE text editor that we will use to write Perl programs. Many of you already are familiar with Notepad, as used in a previous homework, that comes with Windows. This is a greatly improved upgrade. You may download the software from http://notepad-plus-plus.org at this point in time. If the link goes away just Google notepad++ and you will be given a variety of download sites that you can use. In the above case just click on the Download tab and then click Download the current version. Select the Installer.exe version. This will grab the release most recent release. Release 6.1.8 is the current one at the time of this writing. By the time you read this the release number may well have increased. No worries just click on the most recent offering. Once the installer is downloaded you may execute it. This should install Notepad++. When you first execute this little editor you should get a screen that looks like the following.
Don’t be discouraged by all the options, we will use only a few of them. Notepad++ basically works just like any other editor that you have worked with. You type in some text and then save the file after giving it a unique name. Check out the installation by typing in the following Perl program and saving it under the name check.pl where the pl extension tells us that this is a Perl program.
In order to get this to display instructions using colored syntax you might need to click on the Language tab and select Perl. This editor has been designed to work with a lot different languages. In this case you will note that comments, everything after a # symbol is colored green and the print instructions are color blue. Other things such as variable names will be colored differently. This is a great help to those trying to read a Perl program. In order to save the above code click on File and then save-as and select the correct directory. Normally in class we will use D:\Students or something similar. I don’t have a drive D: at the moment so I am using drive C:. Use drive D: in the lab during class. After saving the file it should look like the following. Note the name at the top. C:\Students\check.pl gives you both the directory the file is in and its name. Saving to a jump drive is also an option.
We will assume that you have had enough experience with editors such as Microsoft Word to use the above applications. If you have any problems please ask the Lab assistant or the Instructor. In order to run the above program we will need to install Perl which is what we will do in the next section.
3. Installing and running Perl
There are many versions of Perl on the web but we will be using Active Perl on the labs in class. The website is http://www.activestate.com/activeperl/downloads. Here you can download either the regular 32bit version or the 64 bit one. If you have Windows 7 (64bit) or Windows XP 64 then you can download the 64 bit version. If you don’t have a clue what you have, just download the 32 bit version (ie x86) and it should work in either case. The file you will download is a Microsoft Windows Installer package( note the .msi extension). Save the file and run it. It should be named something like ActivePerl-5-12.2.1202-MSWin32-x86-293621.msi if you are downloading the 32bit (x86) version. It will lead you thru the installation process. The version number may well have changed by the time you read this. If you want to know if you have a 64 bit system just go to computer and click the system properties at the top of the window. The system type will indicate a 64 bit Operating System.