files and regular expressions

posted by tom / February 13, 2006 /

Back to Python! Our previous installments involved installing Python and variables, control structures and functions. I know that the control structure entry might have been intimidating to the non-programmers who are reading this. All I can suggest is that you ask questions — and not worry about it too much. Programming languages are meant to be readable by humans; with enough exposure to them, they do eventually begin to make an intuitive sort of sense. In the short term, I'd suggest that you just try to figure out what this (intentionally inefficient) code does:

i = 1
while(i<10):
   if((i%2)==0):
      print i
   i = i + 1

(remember: % is the modulus sign — x % y gives you the remainder from dividing x by y)

If you feel comfortable reading that code, I'd say we can proceed. If not, ask me some questions via comments or email.

But first, a note about where we're not going. This tutorial is intentionally limited — I just want to show folks how to manipulate the internet with a scripting language, not how to become l33t Python programmers. Among other things that I don't plan to adequately cover:

As you may have guessed from those links, I'm a fan of Mark Pilgrim's Dive Into Python. If you want more detail than I'm providing, I suggest checking it out.

Now then: file handling. Scripting languages are great for automating tasks that your computer-using self is too slow or easily-bored to do. But the fact that you won't be paying attention to the program's run means that you need some way to preserve its actions — you need to give your program a memory, in other words. The simplest way to do this is to read from and write to text files.

This is pretty easy to do. First, let's read from a file. Go ahead and save this page's source code — just go to File|Save As. Change the format to exclude images and other crap (you'll want the "HTML Only" option or something similar). It'll probably also help to save the file somewhere easy to refer to, as well. For Windows users, I'd suggest saving it as C:\tutorial.htm. For Mac users, save it in your home directory as tutorial.htm.

Reading a file into a string is pretty easy. Check it out:

f = open('C:\\tutorial.htm')
fileContents = f.read()
f.close()

We open the file, read its contents into a string, then close the file (this is important to do — otherwise there may be problems when other programs try to access it). Simple, right?

One thing to note is the text in yellow. First, note the double backslash. Remember when we talked about escape characters (aka delimiters)? To refresh your memory: there are some special characters that can be embedded in strings (line breaks, quote marks, tabs, etc) that are designated by a preceding backslash. Because of this, the regular backslash doesn't work normally — you have to use two, as in the example above. It can get a little confusing. An alternative is to turn delimiting off and use a raw string. We do this by placing an 'r' in front of the string:

f = open(r'C:\tutorial.htm')

If you're using a Mac, your filesystem doesn't have a C drive. If you put the file in your home directory, you would refer to it like so:

f = open('/Users/YourUsername/tutorial.htm')

Where YourUsername is replaced by, well, your user name (it's the name of your home directory in Finder). In order to avoid writing two copies of every example, I'll just refer to the path as '/path/to/file' from now on. If you cut and paste any of these examples, remember to replace that path with one appropriate to your system.

So what about writing to a file? It's also pretty simple.

f = open('/path/to/file.txt','w')
f.write('Here is some text\n') # note the line break
f.write('Here is some more text')
f.close()

Note the yellow 'w'. That puts us in write mode. If a file called /path/to/file.txt already exists, it will be overwritten. If it doesn't, it will be created. If you'd prefer to append your content to the end of an existing file, you should replace that 'w' with 'a'. As in write mode, if the file doesn't exist it will be created.

Here's an example that puts both together — and demonstrates how you can read and write files using variables:

sourceFile = open('/path/to/sourcefile.txt')
destinationFile = open('/path/to/destinationfile.txt','w')
fileContents = sourceFile.read()
destinationFile.write(fileContents)
sourceFile.close()
destinationFile.close()

This will copy the contents of sourcefile.txt to destinationfile.txt. Pretty straightforward, right?

But dumping files into gigantic strings won't get you very far — particularly when you're trying to manipulate the internet. If you aren't familiar with HTML, you just need to go to View|Source in your browser to see what a godawful mess most webpages are.

In an ideal world every site would be written in fully-formed, standards-compliant XHTML and we could process it using XPATH and XSLT. But this isn't such a world. Instead, we need a more general tool for finding stuff in HTML — and text files in general.

We could just use a method of the string object called find. It takes in another string as a parameter, and returns the index where it occurs in the calling string (or -1 if it's not found). For example:

exampleString = 'abc'
print exampleString.find('a') # this will print 0
print exampleString.find('bc') # this will print 1
print exampleString.find('c') # this will print 2
print exampleString.find('d') # this will print -1

But this doesn't always work very well. Let's use an example.

Say you're an entrepreneur with a revolutionary new penis enlargement product. The only problem: making sure people know about your wares. I bet a lot of people would appreciate it if you took the initiative and found their email addresses in order to get the word out. Let's also say that you've used some kind of web spidering application to download a bunch of HTML pages. You want to automatically process them, find all the email addresses, and spit them out for future spamming correspondence.

We could try to do this with the find() method. We'd probably search for all the '@' characters. You'd get out an index of the first @. Then you'd search for every space character, and find the indexes that occur immediately before and after that @, then pull that part out of the string. Sounds like a pain, but it's doable — except, wait! What if the file looked like this:

Boy, my friend Pete (pete@somedomain.com) sure is bummed about the size of his genitals.

Matching against spaces in this case will lead us to include the parentheses. And they might not be all — who knows what kind of crap will surround the email address? We could revise our find()-based method, but it's already pretty unwieldy. We need a better tool.

Regular expressions are that tool. They're pretty unbelievably useful, yet also something that a surprising number of programmers don't know how to use. RegExes are one of the many, many practical tools that you won't learn about in school, thanks to CS departments' stubborn refusal to admit that programming is more craft than science.

RegExes allow you to match patterns in text. In their simplest form, they work like the find() method. But they also allow for special "metacharacters", designated by a preceding backslash, each of which has a special meaning:

\w

A single word character - alphanumeric and underscore\w matches 1 or _ but not ?

\W

A single non-word charactera\Wb matches a!b but not a2b

\b

A word boundary, the spot between word (\w) and non-word (\W) characters\bfred\b matches fred but not alfred or frederick

\B

Any single character that is not a word boundaryfred\B matches frederick but not fred

\d

A single digit charactera\db matches a2b but not acb

\D

A single non-digit charactera\Db matches acb but not a2b

\s

A single whitespace character (such as a space, tab or line break)a\sb matches a b but not ab

\S

A single non-whitespace charactera\Sb matches a2b but not a b

\t

The tab character. (ASCII 9)\t matches a tab.

^

The pattern has to appear at the beginning of a string.^cat matches any string that begins with cat

$

The pattern has to appear at the end of a string.cat$ matches any string that ends with cat

.

Matches any character.sp.t matches spot and spit but not spotty (it's too long)

But that's not all! There are also modifiers that can be placed after any character to determine how many times it needs to appear:

?

Preceeding item must match one or zero times.colou?r matches color or colour but not colouur

+

Preceeding item must match one or more times.be+ matches be or bee but not b

*

Preceeding item must match zero or more times.be* matches b or be or beeeeeeeeee

{n}

Bound. Specifies exact number of times for the preceeding item to match.[0-9]{3} matches any three digits

{n,}

Bound. Specifies minimum number of times for the preceeding item to match.[0-9]{3,} matches any three or more digits

{n,m}

Bound. Specifies minimum and maximum number of times for the preceeding item to match.[0-9]{3,5} matches any three, four, or five digits

Finally, there are groupings:

()

Parentheses. Creates a substring or item to which metacharacters can be applieda(bee)?t matches at or abeet but not abet

|

Alternation. One of the alternatives has to match.July (first|1st|1) will match July 1st but not July 2

[]

Bracket expression. Matches one of any characters enclosed.gr[ae]y matches gray or grey

[^]

Negates a bracket expression. Matches one of any characters EXCEPT those enclosed.1[^02] matches 13 but not 10 or 12

[-]

Range. Matches any characters within the range.[1-9] matches any single digit EXCEPT 0

I won't lie to you — composing a good RegEx can be tricky. It usually involves a lot of trial and error, not only because the syntax can be confusing, but also because of differences between systems. A line break is \n on Windows, \r on Unix and (usually) \m on Macs. Different programming languages and programs also sometimes implement RegExes in subtly different ways. The good news is that folks have been at this longer than you or I have. There are plenty of examples already written — just google for what you need and there's a good chance you'll find it, or something like it. I'd also suggest taking a visit to this site, which lets you play around with your RegEx and text to match, with the changes showing up in real time. Remember to begin and end your RegEx with forward slashes (a convention used by lots of languages, including the Javascript that runs that webpage, but not by Python).

So, finally: how do we actually use RegExes in Python? There are various ways of doing this, most of which are shortcuts. I'll try to avoid going over more than we need to.

First, we import the module that provides RegEx functionality:

import re

Then we can do useful things. For instance...

Testing for a Match

import re

r = re.compile(r'sp.t') # build the RE
stringToMatchAgainst = 'see spot run' # The string we'll be searching
m = r.search(stringToMatchAgainst) # this returns a match object

# here's what we can do with that match
if (m):
   print m.group() # prints the matching string — 'spot'
   print m.start() # prints the start position of the match (4)
   print m.end() # prints the end position of the match (8)
else:
   print 'no match found'

Iterating Through Many Matches

You might also want to find more than one match. Here's a solution to the email-finding problem that I talked about before. It's a deeply imperfect solution, for simplicity's sake — it only tries to match .com, .org and .net email addresses, for one thing. For another, it'll fail if the user has so much as a period in the first part of their email address. I've intentionally kept things simple — for an idea of why, see here.

import re

f = open('/path/to/somefile.txt') # click for sample file
fileContents = f.read()

r = re.compile(r'\w+@\w+\.(com|org|net)',re.MULTILINE)
i = r.finditer(fileContents)
for m in i:
   print m.group()
   # all the other properties of the match
   # object are available here, too

Note the MULTILINE flag that is passed to the compile() function. Normally, RegExes run one line at a time — they quit when a line break is encountered. MULTILINE tells the RegEx not to do that. Since our file probably has a bunch of lines, we need to put this in. Otherwise, we'd only get back the email addresses on the file's first line (if there are any).

Grouping

Now for the advanced RegEx stuff: grouping. Sometimes it's useful to be able to pull out particular parts of the match. For instance, let's say we're trying to find all of the links on a webpage in order to extract their URLs. We'll want to construct some kind of regular expression to match HTML that looks like any one of these:

<A HREF="http://domain.com/document.html">link</A>
<A CLASS="something" HREF="http://domain.com/document.html">link</A>
<a href="http://domain.com/document.html" style="somethingelse">link</A>

We can come up with a regular expression to handle this (or steal someone else's). We don't really care about anything from "link" onward, so we'll just try to match the first part. This ought to do it:

<a[^(href=)]+href=["'][^"']+["'][^>]*>

Let's walk through it. The first yellow part matches just what it says — <a, the start of any hyperlink. The next part (red) accounts for all the miscellaneous attributes that can be applied within a hyperlink — the user might have placed a CLASS attribute in there, or a javascript call, or an inline style... we just don't know. So this red expression says match any character that is not equal to 'href='. The 'href=' has to be put inside parentheses because of the way the brackets work — if we didn't use parens, it would mean "match any character that is not equal to h OR r OR e OR f OR =". The parens make ensure that 'href=' will be taken as a whole.

We don't know how many of these non-'href=' characters there'll be, but we know there will be at least one — a space has to come after the <a, after all. So we use a + to say match one or more of the preceding character. Remember, right now the preceding character is anything that's not 'href='.

Finally, we've gotten to the href. The href=["'] says match the characters 'href=', and then either a double or single quote character. The [^"']+ is what corresponds to the URL — it says match any one or more characters that aren't single or double quotes. Then we have a ["'] to represent the closing single or double quote. Finally, we close up the anchor tag. It might end right there, for all we know, or it might have some more attributes before the closing >. We take care of this with the clause [^>]*>, which means match any zero or more characters that aren't '>', then match a single '>'.

When we actually put this RegEx into Python, we'll have to stick backslashes in front of either those single or double quotes in order to get them into a string (depending on whether we have put single or double quotes around the outside of the string that holds the RegEx). Raw strings make it easier to write RegExes, but we still have to make some concessions to the way strings are defined.

This RegEx will only match against the first half of the hyperlink (up to "link" in the above examples), but that's fine. We're just looking for the URLs. We can use this, but there are a couple of other things we'll have to do. Hyperlinks can span multiple lines, so we want to make sure it doesn't stop when it hits a line break — we'll have to use that MULTILINE flag from before. Also, regular expressions are case-sensitive — 'A' is different than 'a'. HTML isn't case-sensitive, though, and we have no guarantees about how the original author wrote this document. So we need to add another flag: IGNORECASE. We can combine these flags by separating them with a vertical pipe (|).

import re

f = file.open('/path/to/somefile.html') # click for sample file
fileContents = f.read()

r = re.compile(r'<a[^(href=)]+href=["\'][^"\']+["\'][^>]*>',re.IGNORECASE|re.MULTILINE)
i = r.finditer(fileContents)
for m in i:
   print m.group()

This will print out something like:

<A HREF="http://domain.com/document.html">
<A CLASS="something" HREF="http://domain.com/document.html">
<a href="http://domain.com/document.html" style="somethingelse">

Wouldn't it be nice if we didn't have to strip off all of that extra junk? We can accomplish just that by using groups. Groups are defined by the parenthetical statements in RegExes. I've already got one in the RegEx we're working on — it's necessary for the clause [^(href=)]+, which means "any one or more character(s) that are not 'href='".

This one is in brackets, so the RegEx won't keep track of it (you could always put parentheses around the brackets if you wanted). But every other one of these groups gets stuck into the array returned by the match object's group() method. group(0) returns the entire matching string; from there up, it returns the contents matching each parenthetical group. Like so:

import re

f = open('/path/to/somefile.html')
fileContents = f.read()

r = re.compile(r'<a[^(href=)]+href=["\']([^"\']+)["\'][^>]*>',re.IGNORECASE|re.MULTILINE)
i = r.finditer(fileContents)
for m in i:
   print m.group(1)

The slight changes to this code will make it just spit out the URL part. Give it a try.

Substitutions

Last but not least. RegExes can be used to perform search & replace -style operations on strings. It's pretty straightforward. Here's an example that will change all the email addresses in a string into text reading 'EMAIL REMOVED'.

import re

f = open('/path/to/somefile.txt')
fileContents = f.read()

r = re.compile(r'\w+@\w+\.(com|org|net)')
print r.sub('EMAIL REMOVED',fileContents)

Pretty straightforward. Just oooone last cool thing: using groups in substitutions. Yup, it's possible. Say we want to create a spam-proof(ish) version of a string by automatically converting somebody@somedomain.com to somebody(AT)somedomain.com. It's no sweat — putting a backslash in front of the group number (e.g. \1, \2) lets us refer to groups matched in the initial string.

import re

f = open('/path/to/somefile.txt')
fileContents = f.read()

r = re.compile(r'(\w+)@(\w+\.(com|org|net))')
print r.sub(r'\1(AT)\2',fileContents)

Make sense? I sure as hell hope so — if it doesn't, I've wasted an otherwise-lovely Incredibly Hungover Sunday.

Comments

while(i<10):

This ain't C, pal. "while i<10:" will do you just fine. (Same goes for the if on the next line.)

Posted by: ben wolfson on February 13, 2006 12:30 PM

aesthetics, Ben! I thought they were supposed to be teaching you about this stuff.

Posted by: tom on February 13, 2006 12:38 PM

It looks far better my way, so I'm not sure how that helps you.

Posted by: ben wolfson on February 13, 2006 01:49 PM

I guess I'm just not equipped to appreciate your avant-garde coding style. Consider me the dogs playing poker of programming standards.

(mostly, though, I just didn't want people to train people to avoid a convention that many other programming languages require)

Posted by: tom on February 13, 2006 03:25 PM

Matlab does not require the extraneous parenthesis.
Example: while i

Pascal (a lovely language that was one of my firsts) also did not require paras...
Example: while i

And if I recall correctly, BASIC also does not use paras...

Posted by: Tomas on February 13, 2006 04:41 PM

Those are educational programming languages (save VB, which is an educational programming language gone wildly out of control, which now ought to be killed -- also, Pascal "lovely"?!?). Most real programming languages do require parens:

C#
Java
ECMAScript
C/C++
Perl

You'd admittedly be fine in PHP, though (although the docs say you ought to use them).

It's not a bad habit to get into. Personally, I think it makes the code more readable.

Posted by: tom on February 13, 2006 04:51 PM

(mostly, though, I just didn't want people to train people to avoid a convention that many other programming languages require)

But they aren't learning those languages. Oughtn't you learn the style of the language you're learning? Isn't the "I can write C [or what have you] in any language" attitude not the best?

Anyway, other languages that don't require parens include Scheme, LISP (on a certain understanding of precisely which parens we're talking about), Haskell (also has significant whitespace), SML, sh, and Smalltalk, real languages all. Another language which does require them, notable for not being directly related to C, is Dylan.

I think that writing like this: "while(condition)" is actually less readable than "while (condition)" (though I actually use the former when parens are required), because all the text is jammed together, and the reading of the line becomes choppier.

Given that in C/C++, if you didn't need the parens, you could write something like "while x = getchar() *y++=x;", which is horribly unreadable, they certainly provide a service there.

Posted by: ben wolfson on February 13, 2006 06:36 PM

Of course, I could be a jackass and point out that the following is valid in C/C++:
while 1
{
// do something
}

No parens necessary! Why would you want to do this? Perhaps you're listening to a TCP socket and waiting for incoming data. When a connection is received, you "do something" and then keep listening to the socket. To exit this infinite loop, a break statement can be placed or an interrupt can be specified. I'm sure the same is valid in Java, but I've never had to do this in that language.

Posted by: Tomas on February 13, 2006 10:37 PM

From a YACC grammar for ANSI C (don't know what version):
iteration_statement
: WHILE '(' expression ')' statement
| DO statement WHILE '(' expression ')' ';'
| FOR '(' expression_statement expression_statement ')' statement
| FOR '(' expression_statement expression_statement expression ')' statement
;
This suggests rather strongly that parens are necessary. How could it be otherwise? Unless "1" is special-cased, you could have any expression where "1" is in Tomas' comment, and parens would never be necessary.

Posted by: ben wolfson on February 13, 2006 10:43 PM

The grammar. It's from 1995 but I assume this hasn't changed in C99.

Posted by: ben wolfson on February 13, 2006 10:45 PM

And, of course, gcc complains about it ("foo.c:5: error: parse error before numeric constant"). Conclusion: Tomas was being a jackass, but not in the way he thought he was.

Posted by: ben wolfson on February 13, 2006 10:47 PM

Not valid in Java, Tomas. The parens are required and the condition must evaluate to boolean "true" instead of nonzero.

Posted by: Becks on February 13, 2006 11:50 PM

Admittedly I was wrong about the 1 sans parens in C... I spend too much time with different languages that I forgot. However, the while (1) is still useful trick. At least everyone learned something from this...
In any case, I was probably thinking about shell scripting (csh or sh). Tom, I know that if you type while 1 at a Terminal on your Mac (sans parens) it will work (at least it does on my system).
Try this:
% while 1
while? echo "This will work"
while? end

And then watch as your screen fills up with the echo statement. Just press Ctrl-c to abort.

Last post, I promise.

Posted by: Tomas on February 14, 2006 12:34 AM

That may work in csh (though as we all know, csh programming is considered harmful), but it doesn't work in sh. (Actually I just checked and my sh is a symlink to bash. Whatever.) Observe!
[coelacanth ~ 09:50:53]$ while 1; do echo hi; done
sh: 1: command not found

This, however, has the expected effect:
[coelacanth ~ 09:51:10]$ while true; do echo hi; done
(output not rendered for obvious reasons.)

Posted by: ben wolfson on February 14, 2006 12:53 AM

I think it's funny that this post contains both the statement that programming languages shoudl be human readbable and then the discussion regular expressions, the most unreadable (yet powerful) part of any programming language.

I think the regexp section could do with a few examples before getting into the tables of strings. Also, those tables could probably stand to be ordered a bit more sensibly (e.g. the very first thing, \b, references \w and \W).

The basics of regexp are probably '.', '*', '^' and '$'. You can get some pretty easy to understand examples out of these four:

avoid blank lines: line doesn't match /^$/
skip comments: line doesn't match /^#/
remove comments replace /#.*$/ with nothing

etc.

I think the reason more people don't use regexp is because it can be very overwhelming; there's so much that the system has to provide that it's hard to sort through all of it. The multiline href thing is a pretty complex example, and I can't even totally get my head around the expression. It ties together a lot of concepts, and then isn't even totally correct for trapping all hrefs (most browsers will accept a ref without quotes I think, and you can also create an href with an empty body, e.g. href="")

Posted by: Here's a Hint on February 14, 2006 11:17 AM

BTW, the section on file operations was easy to follow and very straightforward. And I'm all in favor of parenthesis :)

Posted by: Here's a Hint on February 14, 2006 11:19 AM

Good points, HaH. I basically started itching to just get through the damn thing -- regexes are an admittedly big topic, and I was getting into double-digit hours spend on the tutorial. But you're right -- the result may be less than clear. And the tables are just chopped up versions of a resource I found elsewhere. I'll plan to revisit and revise that section, expanding the introduction. For now, folks who are confused may want to check out this quickstart guide.

And for the record: to trap non-delimited hrefs, change each ["'] to ['"]? and [^"'] to [^'"\s]. That makes the opening and closing quotes optional, and tells the system that the "meat" of the href is anything that's not a single or double quote or whitespace.

In general you'd want to look at some examples of what you're trying to match to see how it's structured. It's fairly rare that you need to pull apart web pages from many different sites at once, so the HTML will generally follow a consistent style.

Posted by: tom on February 14, 2006 11:33 AM

okay, you've shamed me into it: I've expanded the RegEx section. I might change the structure around more significantly, but now there's at least a detailed walk-through of the link-matching example.

Posted by: tom on February 14, 2006 11:57 AM

Nice. The color coding helps quite a bit in breaking it down.

And, as someone who's written a lot of regexp HTML parsing hacks, consistency of format is not as prevelant as I would've hoped. There's some choice unreadable regexps in the code running my site. It's almost such that if I need to change something, I rewrite the regexp rather than figure out what's broken.

Posted by: Here's a Hint on February 14, 2006 04:12 PM

So, of course, the first line after "r = re.compile(r'''" should read <a, but I forgot to escape the less-than, resulting in the mess you see before you.

Posted by: ben wolfson on February 14, 2006 05:08 PM

Whoops! Good catch on the compilation flags, Ben. I've made the change.

Posted by: tom on February 14, 2006 10:49 PM

Post A Comment

Name


Email Address


URL


Comments


Remember info?



Google Analytics