Regular Expression Tutorial

Version 1.1.0.0
Rev A

How to Contact Us

OSIsoft, Inc.
777 Davis St., Suite 250
San Leandro, CA94577USA
Telephone
(01) 510-297-5800 (main phone)
(01) 510-357-8136 (fax)
(01) 510-297-5828 (support phone)

Houston, TX
Johnson City, TN
Mayfield Heights, OH
Phoenix, AZ
Savannah, GA
Seattle, WA
Yardley, PA / Worldwide Offices
OSIsoft Australia
Perth, Australia
Auckland, New Zealand
OSI Software GmbH
Altenstadt,Germany
OSI Software Asia Pte Ltd.
Singapore
OSIsoft Canada ULC
Montreal, Canada
OSIsoft, Inc. Representative Office
Shanghai, People’s Republic of China
OSIsoft Japan KK
Tokyo, Japan
OSIsoft Mexico S. De R.L. De C.V.
Mexico City, Mexico
Sales Outlets and Distributors
  • Brazil
  • Middle East/North Africa
  • Republic of South Africa
  • Russia/Central Asia
/
  • South America/Caribbean
  • Southeast Asia
  • South Korea
  • Taiwan


OSIsoft, Inc. is the owner of the following trademarks and registered trademarks: PI System, PI ProcessBook, Sequencia, Sigmafine, gRecipe, sRecipe, and RLINK. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Any trademark that appears in this book that is not owned by OSIsoft, Inc. is the property of its owner and use herein in no way indicates an endorsement, recommendation, or warranty of such party’s products or any affiliation with such party of any kind.
RESTRICTED RIGHTS LEGEND
Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013
Unpublished – rights reserved under the copyright laws of the United States.
© 2002-2007 OSIsoft, Inc.RegExpTutorial.doc

Table of Contents

Introduction

RegExp Tester

The Basics

Wildcards

Sets

Escaped Characters

Position Characters

Repeats

Substitutions

Whole Text Substitution

Reordering Text

Example Searches

Revision History

1

Regular Expression Tutorial1

Introduction

This document is intended to help users of PI interfaces that make use of Regular Expressions.

RegExp is a relatively old utility for searching text and making substitutions. The main concept behind using RegExp is matching a generic pattern the user has supplied to the specific text that is given. A very simple pattern match is used in Windows all the time, with the wildcard character. By bringing up a command prompt, if you issue the command dir c:\winnt\system32\*.dll, you’ll get a list of all the files whose full file name, including path, start with c:\winnt\system32\, and have any amount of text after that, and end with .dll. The c:\winnt\system32\*.dll can be considered the pattern, and all the files that are returned are matches.

1

Regular Expression Tutorial1

RegExp Tester

You may find the following discussion easier to understand if you follow along by using the RegExp Tester program. This utility will allow you to enter text to search, search patterns, substitution patterns, and it will perform the search and replace in the same way that any product that uses the RegExp implementation built into Internet Explorer does.

The Search Text field is where you enter the text you want to search. The Search pattern field is where you enter a pattern in the text you want to find. The Substitution pattern field is where you can put a pattern that will be substituted for the search results. If you just want to perform a search without a replace, leave this field blank. Press the Execute button to perform the search and replace.

1

Regular Expression Tutorial1

The Basics

This section will show you the basics of RegExp using the RegExp object built into Internet Explorer. The product you use may or may not be based on the implementation of Regular Expressions built into Internet Explorer.

First, here is a simple example:

Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal.

This is the first sentence of Lincoln’s Gettysburg address. The main parameters to a RegExp search are the text being searched (the text above), and the pattern we are trying to find. Let’s first use a pattern of fathers. The RegExp engine would match the eighth word in the sentence, fathers. Pretty simple; we searched for fathers and found it.

Now, let’s try to match the word apple. The RegExp engine would return a blank, because the word apple does not appear in the text.

Wildcards

Now, to add a little complexity, let’s try to find some text using wildcards. The wildcard character is the period (.). The period represents any single character. So if we were to try to search for d.dicated, the search would return the word dedicated.

In this search, the period was allowed to be any character, so the d.dicated matched with dedicated, because the letter e certainly counts as “any character”. A search for ur....r. would return ur score from four score, because there is a u and an r in the last half of the word four, the four periods can match with the space, the s, the c, and the o, there is another r, and finally, the last wildcard character matches with the e at the end of score. In this instance, the space between four and score counts as a character, so it is matched by the wildcard. A search for“.” would return the F at the beginning of the sentence. Searches only return the first match. Since the wildcard character can match anything, the F is the first match it found.

There are several specialized wildcard characters you can use in specifying a pattern.

Wildcard / What it matches
\s / Whitespace (tab or space)
\w / Word characters (digits, letters, and “_”)
\d / Digits

Searching the example text above for the pattern ..r\sf., the sequence our fa would be the returned match.

The first two wildcards match any characters, o and u in this case. The r matches with the r in our. The \s matches the space between our and fathers. The f matches the f at the beginning of fathers. The final period matchesthe a in fathers. If the word sheriff had somehow appeared in the sentence, the sub-word heriff would not have matched. The i between the r and the f does not match the whitespace wildcard.

The word and digit wildcards are similar. A \w will only match a digit, letter, or the underscore character. A \d will only match a digit.

Capitalizing these wildcard characters instructs the search to match the opposite of its lowercase counterpart. So a search for the pattern ..r\Sf. in the Gettysburg address sentence would not match our fa, because the \S instructs RegExp not to match anything other than a space. It would match heriff because the I does match with the non-whitespace wildcard \S.

Sets

It is possible to specify sets of characters to match in a search. For example, to search for the pattern vowel-space-vowel in the Gettysburg Address sentence, you would need to use a set. There is no vowel wildcard character. So the set of the letters a, e, i, o, and u would constitute the set of vowels. A set is represented using brackets ([ and ]). The pattern vowel-space-vowel would look like: [aeiou]\s[aeiou]. This search’s results would be e a from the words score and.

The e matches with the set [aeiou], the space matches with the whitespace wildcard \s, and the a matches with the set [aeiou].

The characters in the set can be excluded by adding the carat (^) just inside the brackets. For example, to search the sentence for the pattern not-a-vowel-space-vowel, you could use the patterns: [^aeiou]\s[aeiou]. This would match s a from years ago. The letter s is in the set “anything but a, e, i, o, or u”, the space matches the \s, and the a matches the set “a, e, I, o, or u”. Note that [^aeiou] will match anything other than a, e, i, o, or u. This includes digits, whitespace, and punctuation.

Another modification on sets is the range character. A hyphen will indicate a range in the set context. For example, the range [a-h] will match any lowercase letter between a and h, inclusive. The range [a-mo-z] will match any lowercase letter except the lowercase n. When determining if one character is between two others, ASCII representations are used. For example, the space character is represented in ASCII by the number 32. Uppercase A is 65, and lowercase a is 97. So the range [A-a] would include B (ASCII 66), ^ (ASCII 94), but not b (ASCII 98).

You can combine the not-in-set character and the range character. To search for anything other than the uppercase letters, you could specify [^A-Z]. This will match anything other than A through Z.

Escaped Characters

Some characters hold special meaning in RegExp pattern matching. For example, the brackets delimit set definitions. The carat indicates a not-in-set declaration. For that reason, if you actually need to search for a bracket, a carat, or any of the other special characters, you’ll need to “escape” the character by putting a backslash (\) in front of it.

The following table shows escape characters:

Literal Character / Escaped Character
. (period) / \.
* (asterisk) / \*
+ (plus sign) / \+
? (question mark) / \?
| (pipe character) / \|
\ (backslash) / \\
^ (carat) / \^
$ (dollar sign) / \$
( (left parenthesis) / \(
) (right parenthesis) / \)
[ (left bracket) / \[
] (right bracket) / \]
{ (left curly brace) / \{
} (right curly brace) / \}
New Line (LF) / \n
Carriage Return (CR) / \r
Tab (HT) / \t
Vertical Tab (VT) / \v
Page Break (FF) / \f

Position Characters

There are three patterns that do not match a character, but a position. These three are start of line (^), end of line ($), and word boundary (\b).

Start of Line (^)

The start of line pattern, the carat, will allow the following: Match the word February only if it appears at the beginning of a line. The search pattern would be ^February. So if there is a date (assume it will always be February something) that you want to search for, and the date is in a format where the full month is first, then the day, then a comma, then a year, but the date you want is at the beginning of a line, your pattern could look like: ^February \d\d, \d\d\d\d. The ^ will match the beginning of the line, the February will match itself, the next two \dwill match the day of the month (assuming a 2-digit day), and the next 4 will match the year.

Do not confuse the carat in this context with the carat in the set context, which is the not-in-set character. That carat will always be inside brackets.

End of Line ($)

The end-of-line pattern, the dollar sign, works in exactly the same way, except on the end of the line instead of the beginning of the line.

Boundary (\b)

The word boundary pattern (\b) works in a similar way. A word is defined as a series of letters, numbers, and the underscore. \b would match anywhere there is a word break (beginning or end). Searching the text this is a sentence for \bsentwould return sent. Note that the space is not included. \b does not match the space; it matches the spot between the space and the next word.

Repeats

You can modify your search to look for repeating characters. In the simplest repeating pattern, use the curly braces ({ and }) after a pattern to search for that pattern repeated N times. For example, to search Look at that! for 2 o’s there are two ways you could search. You could use the pattern oo, or you could use the pattern o{2}. o{2}will match two consecutive o’s. A little more complicated example is to search for any word that has 4 letters, and the middle two are o’s. For example, search You are a fool! for such a pattern. The search pattern would be \b\wo{2}\w\b. The first \b indicates a word boundary. Then, the \w indicates any word character (letter, number, or underscore). The o{2} indicates two o’s. The second \w indicates another word character. The \b indicates another word boundary. The word fool would match this search pattern.

A variation on matching exactly N times is matching N to M times. To search for a pattern N to M times, use the following notation: {N,M}. In the previous example, change the pattern to \b\wo{2,4}\w\b. The word fool will still match, but foool and fooool would also match, because the o is repeated between 2 and 4 times in each of those words. fol and foooool would not match, because the o only appears once and 5 times, respectively, in each word.

Another variation on matching a pattern exactly N times is matching a pattern at least N times. This is denoted by putting a comma after N: {N,}. \b\wo{2,}\w\b would match fool, foool, fooool, foooool, etc. Any word that has some word character, at least two o’s, and one more character will match this pattern.

The * character will match the preceding pattern 0 or more times. This is very useful in matching an unknown number of characters. The pattern fo*l will match any part of the searched text that has an f, any number of o’s (including zero), and an l. fl, fol, fool, etc. would all match the pattern. Combining the asterisk with the period (any character) is an extremely useful way of finding unknown data. For example, if there is a sentence that starts with Today’s temperature and ends in a number and then a period, and you want to extract this whole sentence out of a paragraph or a page, the pattern Today’s temperature.*\d\. would be the pattern to search for. The beginning of the pattern is the phrase Today’s temperature. The period denotes any character and the following asterisk indicates that we’re looking for any character any number of times. The \d and \. indicate that we want the pattern to end at a digit and a period. This pattern would match the following sentences:

  • Today’s temperature in the San Leandro area will be 55.
  • Today’s temperature in Oakland and the EastBay was supposed to be 67.

There is another important property of the asterisk pattern. By default, the search results will return the longest match that fits the pattern. So if yousearch the text This is the end of the line! for the pattern This.*n, your search results would be This is the end of the lin. To match This is the en, add a question mark after the asterisk. So the pattern This.*?n would return This is the en.

The.* combination is very powerful and will be discussed further in the Substitutions section. * is equivalent to {0,}.

The plus ( + ) character is identical to the asterick ( * ) character, except that the plus indicates that we want the search to match the preceding pattern at least once. It is equivalent to {1,}.

The question mark ( ? ) character indicates that we want the search to match the preceding pattern once or none at all. It is equivalent to {0,1}.

Substitutions

Searching for a pattern is useful, but many times more detail is in the search criteria than you want to wind up keeping. For example, if the following line appeared as a search target:

Forecast: temperature-76 degrees, wind-25 mph, humidity-60%.

If the goal is to extract the wind speed out of this line, the search might only look for digits as with the pattern: \d+. This search pattern would search for one or more digits in a row. However, this search would return the temperature, not the wind speed, because temperature is matched first. The pattern might instead include the word wind, then a space, and then the actual wind speed digits with thewind-\d+. However, there is a problem with this as well. This search would return wind-25. wind-25 cannot be successfully converted to an integer, because it has the word wind included. This is where substitution is required.

Whole Text Substitution

Substitution will replace the entire matched pattern with whatever you specify. Although useless, the match can be replaced with a hard-coded text. For example, a search ofthe weather forecast above for the pattern wind-\d+ could be replaced with the number 60. It would take your match, wind-25, replace it with 60, and keep 60 as the final substitution result. However, a much more useful application of substitution is using parts of the search result to replace the entire search result.

In this example, we want to replace wind-25 with just the number 25, which is part of the search result. To do this, first modify the search pattern by adding parentheses around the part of the search pattern to use as the replacement. In this case, put parenthesis around the \d+ part of the search pattern so it becomeswind-(\d+). This does not affect the search. Technically, what occurs is the \d+ is marked as a “group”. Now, in the substitution pattern, the search result wind-25is to be replaced with the contents of the group, 25. To indicate the contents of a group from the search pattern in the substitution pattern, use a dollar sign ($) followed by the number of the group. In this case, we only have one group, so that number is 1. The substitution pattern would be $1.