Tokenization using regular expression sub patterns
Tokenization using regular expression sub patterns
A while back was writing some stuff on this blog about regular expressoins. While that remains unfinished, a mini regex example - nothing earth shattering but a useful technique if you hadn’t already seen it. Promtped by a real world example, one often-overlooked feature of most regular expressions engines is how subpatterns can useful to whip […]
A while back was writing some stuff on this blog about regular expressoins. While that remains unfinished, a mini regex example - nothing earth shattering but a useful technique if you hadn’t already seen it.
Promtped by a real world example, one often-overlooked feature of most regular expressions engines is how subpatterns can useful to whip up tokenizers relatively easily. The problem? I needed to match the word any of the words “Canton”, “Region” or “Group” in a string and perform a follow up action depending on which matched.
Dealing with four main languages in Switzerland ( German, French, Italian and English), it get’s a bit more interesting; “Canton” translates to “Kanton” in German and “Cantone” in Italian, while “Region” is “Regione” in Italian. and Group is “Gruppe”, “Groupe” and “Gruppo” in German, French and Italian respectively. Composing those into three straightforward regular expressions I have;
- Canton:
/cantone?|kanton/i - Region:
/regione?/i - Group:
/groupe?|grupp(?:o|e)/i
Now on examining some input string, I could try testing each of those regexes individually against the string but that’s a) inefficient and b) likely to lead to lengthier code. Instead I make a single regular expression using sub patterns: /(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i …then figure out which sub pattern matched after a match is made.
Note that technically this problem is not really one of tokenization but rather just classifying the input with a common name, but the technique can be fairly easily extended. In PHP the solution is courtesy of the third argument to preg_match(), for example;
$inputs = array( 'Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe'); foreach ( $inputs as $input ) { preg_match("/(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i", $input, $matches); print_r($matches); }
I get output like;
Array ( [0] => Kanton [1] => Kanton ) Array ( [0] => Regione [1] => [2] => Regione ) Array ( [0] => Gruppe [1] => [2] => [3] => Gruppe )
Notice how the first element of this array in always what I matched while elements indexed 1+ correspond to the position of subpattern I matched against, from left to right in the pattern - this I can use to tell me what I actually matched e.g.;
$inputs = array( 'Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe'); $tokens = array('canton','region','group'); // the token names foreach ( $inputs as $input ) { if ( preg_match("/(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i", $input, $matches) ) { foreach ( array_keys( $matches) as $key) { if ( $key == 0 ) { continue; } // skip the first element // Look for the subpattern we matched... if ( $matches[$key] != "" ) { printf("Input: '%s', Token: '%s'\n", $input, $tokens[$key-1]); } } } }
Which gives me output like;
Input: 'Kanton Zuerich', Token: 'canton' Input: 'Frauenfeld Regione', Token: 'region' Input: 'Fricktal Gruppe', Token: 'group'
…so I’m now able to classify the input to one of a set of known tokens and react accordingly. Most regex. apis provide something along this lines, for example here’s the same (and much cleaner) in Python, which is what I actually used on this problem;
import re p = re.compile('(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))', re.I) inputs = ('Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe') tokens = ('canton','region','group') for input in inputs: m = p.search(input) if not m: continue for group, token in zip(m.groups(), tokens): if group is not None: print "Input: '%s', Token: '%s'" % ( input, token )
Could be reduced further using list comprehensions but don’t think it helps readability in this case.
An alternative problem to give you a feel for how this technique can be applied. Let’s say you want to parse an HTML document and list a subset of the block level vs. the inline level tags it contains. You might do this with two sub-patterns e.g. (?(?:div|h[1-6]{1}|p|ol|ul|pre).*?>)|(?(?:span|code|em|strong|a).*?>) (note this regex as-is is geared to python’s idea of greediness - you’d need to change it for PHP) leading to something like this is python;
p = re.compile('(?(?:div|h[1-6]{1}|p|ol|ul|pre).*?>)|(?(?:span|code|em|strong|a).*?>)’) for match in p.finditer(’foo
test bar test 1
bar’): print “[pos: %s] matched %s” % ( match.start(), str(match.groups()) )
The call to match.groups() returns a tuple which tells you which sub pattern matched while match.start() tells you the character position in the document where the match was made, allowing you to pull substrings out of the document.
This article provided by sitepoint.com.
Out of Office
I’ll be out of the office until December. If you have questions related to content development for your Case Web site, event postings, news stories or announcements in Case Daily feel free to contact the Office of Marketing and Communications. If you have questions regarding the username and password for your account on the main campus Web server (Aurora) contact webmaster@case.edu. If you need to transfer your account to a new maintainer, fill out an account application form and indicate that it is a transfer. If you have questions about uploading files to your account, review the entry on Uploading…
I’ll be out of the office until December.
If you have questions related to content development for your Case Web site, event postings, news stories or announcements in Case Daily feel free to contact the Office of Marketing and Communications.
If you have questions regarding the username and password for your account on the main campus Web server (Aurora) contact webmaster@case.edu. If you need to transfer your account to a new maintainer, fill out an account application form and indicate that it is a transfer.
If you have questions about uploading files to your account, review the entry on Uploading your files with Dreamweaver.
If you need to download Dreamweaver from the software center it is now part of the Adobe Creative Suite.
Happy Thanksgiving!
Web Content: Not just YOUR words and pictures
If they read what you write, they may also want to read what you read. Webmasters and bloggers know this. That’s why we’ll embed links within our text, build pages with links to recommended sites and/or add linkblogs to our side bars. In a world where most of us don’t have the time to research and write everything we’d like to share, such resources add value to our existing content and give readers guidance on where to look for additional information.
If they read what you write, they may also want to read what you read. Webmasters and bloggers know this. That’s why we’ll embed links within our text, build pages with links to recommended sites and/or add linkblogs to our side bars. In a world where most of us don’t have the time to research and write everything we’d like to share, such resources add value to our existing content and give readers guidance on where to look for additional information.
Recently I’ve come across some other good ways to share what you read, so I thought I’d share those with you today.
Publish your OPML file to share your blog subscriptions
A few weeks back I was reading an entry on David Bradley’s blog, Sciencebase, when I noticed something interesting in his footer. There, at the bottom, he has a section called “Geeky Fun Stuff” in which he shares, among other things, a link to his OPML file. That, I thought, is a really good idea. For those of you who don’t know what this is, an OPML file is basically a type of XML file that includes the links to the RSS feeds of the various blogs one reads through RSS Readers such as Google Reader, Bloglines, etc. Such services allow you to import and export these files so that you can easily switch services or add a batch of feeds to your existing service. Thus, if I wanted to subscribe to all of David’s feeds I could just save that file and import it into Google Reader myself. Or if I wanted to subscribe to only a few I could edit the file (in Dreamweaver or any plain text editor) to delete any I didn’t want.
If you are already using an RSS reader, sharing such a file is fairly easy. Just export your file from your reader and save it to your computer. If you don’t want to share everything, just open the file in a text reader, and delete the extraneous feeds—lolcats, curling news from In the Hack and anything else that may not be of interest to your readership. Once the file is ready, just upload it to your site and link to it as you would any other page.
Using Google Reader to share specific stories
Google Reader recently added some enhancements to its sharing features. I first noticed this when Robert Scoble posted a note on Pownce with a link to his Google Reader shared items page. When I went to view the page I realized that this could be a useful feature, one that made me want to revisit Google Reader.
When you visit a shared items page you will see a site that looks pretty much like a typical blog. Stories are posted on the left, information about the page owner is on the right—along with links to other resources, a feed, etc. The main difference is that the stories are things the page owner has read rather than written (though, if you subscribe to your own feed, you can share your own entries as well). Each story also includes a link to the original entry and the original source—so the material is not mistakenly attributed to you.
After viewing Scoble’s page, I immediately thought of my friend X, who says she wants to establish an online presence, but isn’t quite yet ready to blog. Sharing stories on a page she can link to might be a good way to get her feet wet and let people know what she is thinking about.. For those of us who already have one or more blogs and Web sites, the shared items page adds to our online mix and provides an easy way to share stories with our readership.
Getting started with Google Reader
Getting started with this is pretty easy. Just go to the Google Reader site and sign-in. If you don’t already have a Google account for Gmail, analytics, etc. you can create one there. Once your account is set up, just subscribe to some of your favorite blogs and start reading. A menu at the bottom of each story gives you the option to share the story so it will appear on your public shared items page. (There is some controversy about this, but you just have to understand that it’s a public page that can be seen by anyone who has, or discovers, the address. For us, that is what we want, so it’s not a big concern.) If you change your mind later, you can unshare the story the same way. You can also organize your subscriptions into topic folders and share topics rather than individual items. To learn more about using Google Reader visit the Reader Help Center.
If you don’t want to send readers to your Google page, but still want to share stories, you can also share a clip from the feed on your own site, as I have done on my “What I’ve Been Reading in the Blogosphere” page.
More sharing options
Google isn’t the only service that allows sharing, but aside from the hubbub regarding privacy settings (pertaining to how and with whom one is sharing—see links below) it’s very easy to use and will be familiar to a large audience. I’ve been sharing blog stories with groups on Streamy since last summer, but my Streamy shares aren’t fully public. Another friend has recently recommended Feed Each Other which looks promising and also produces a public page. StumbleUpon, while not a reader, is also a great way to share blog stories and other Web sites. (Stumbling is quite popular with insomniacs and is a great way to learn about other sites.)
These are all useful services, but how you share is less important than what you share. If you can find articles and sites that offer additional information on the topics you discuss, or even stories that add insight to your personality or world view, you’ll be providing a helpful resource to your readers.
OPML, Google Reader and Sharing Resources
- fav.or.it - favorit RSS Reader and Blogging Platform
- Google’s new Reader Features
- Google Reader needs GPC (Granular Privacy Controls)
- Google Reader “Share With Friends” Feature Gets Privacy Complaints
- Google Reader Sharing FAQ
- Is Google Reader Sharing Too Much?
- OPML (Outline Processor Markup Language)
Tips for Nonprofits Meme
Elizabeth Able, of Able Reach Arts and Web Development, recently started a blog meme in support of nonprofits that have an online presence. She asks that we write one tip on ways nonprofits can benefit from having an online presence and have others do the same.

Snowcrystals.com shares a wealth of knowledge
Elizabeth Able, of Able Reach Arts and Web Development, recently started a blog meme in support of nonprofits that have an online presence. She asks that we write one tip on ways nonprofits can benefit from having an online presence and have others do the same. Tips can have similarities so long as each offers new insight into the topic.
This meme comes with four guidelines:
- Offer one tip
- Tag three people. Bonus points for including blogs that support or represent nonprofits.
- Please link back to the original entry page. If you link, Elizabeth will contact you about including your tip in a compilation of tips generated by this meme.
- Remember to pass on the guidelines
Now that we know the rules, here is my tip.
Share your knowledge and expertise in the form of educational resources
Nonprofits come in many shapes and sizes. Whether they are confronting issues relating to poverty, arts & culture, health care, education or public policy, each is likely to have specific and in-depth knowledge relating to their mission. While their Web sites will often focus on their core mission, volunteerism, fund-raising and related issues, much can be gained from sharing their broader knowledge base as well.
In this case, when I speak of knowledge, I’m not referring to the facts and figures used in support of the cause, but the more in-depth knowledge or data related to the topic. Thus a public art organization, that uses its site to announce projects and explain how art benefits society, may also want to publish related resources such as:
- A walking tour of public art in the region served by the organization.
- Interviews with artists explaining how they came to the field, what education this required and where they seek their inspiration.
- A history of the role of public art from ancient times to the present with images and links to more specific resources.
- Pages explaining how sculptures are made, from the design process to the casting of metal and other techniques.
Benefits of knowledge sharing
Sharing such knowledge can support an organization in many ways. In most cases the expertise and knowledge is already in the minds of the staff—who draw on this information in their own work. Sharing it with others benefits society by providing information resources, but also supports marketing and fundraising.
- Educational resources geared to K-12 students and/or the general public help the organization to reach a wider audience geographically and demographically. This builds name recognition and supports the organizations brand, enhancing the reputation of both the organization and its staff through the quality of its content.
- Sites providing educational outreach may be eligible for additional funding from foundations and government agencies that support such programming.
- Informative, and fun, resources help to stimulate interest in the topic thus cultivating readers towards becoming future donors, volunteers and champions to the cause.
Bastions of Knowledge: Examples
Many faculty and staff here at Case have heard me discuss sites I call “Bastions of Knowledge,” places where faculty and staff can share their expertise with the public. As mentioned above, such sites provide educational outreach and support marketing. A site that becomes known as one of the leading resources in a given field bolsters the organizations reputation, but can also draw additional traffic to the rest of the organization’s site. Two of my favorite examples are:
- Snowcrystals.com, produced by Kenneth G. Libbrecht, chairman of the Physics Department at Caltech
- This site has anything you could possibly want to know about snowflakes, from the physics of how they develop and the impact of temperature on crystal formation, to some stunning photographs of individual snow crystals. If you Google the term “snowflake,” this site shows up as the number 2 result—out of 9,050,000. A search on “snow crystal” puts them 1st out of 366,000 results. When you consider the number of children studying snow in school, the adults who are curious to learn more and scientists interested in crystal formation and/or considerations of temperature, you have to imagine that this site gets a lot of traffic. Professor Libbrecht didn’t have to share his research with all of us, but in doing so he has provided a fascinating resource and made more people familiar with his department and Caltech.
- The eSkeletons Project, University of Texas at Austin
- The e-Skeletons project doesn’t rank quite as high in Google, it comes in 8th out of 6,760,000 on a search of the word “skeleton.” but that’s still very impressive. So is the content. This site provides images of individual bones, from all orientations, from 12 primate species including humans. Animations, FAQ’s and other information make this a terrific resource for teachers and students alike. As a K-12 educational resource, the site also receives both corporate and government support.
In Conclusion
If you’re working for a non-profit or similar organization, go ahead and share the information in your head. You’ll provide a service to others as well as yourself.
As per the instructions of the meme guidelines, I’ll tag Mano Singham, Jeremy Smith, Lev Gonick and Gina Prodan, as I’m curious to hear what they have to say on the matter.
Linkbait: Tasty morsels to entice readers
It sounds nefarious doesn’t it? Makes one think of “bait and switch” or that run-down old bait store by the lake—the one where they store the containers of nightcrawlers in the same cooler as the egg salad sandwiches. Blech.
As restaurants display fresh seafood to
entice diners, you can create linkbait
to increase your readership.
This is the fourth in a series of posts that discuss Search Engine Optimization (SEO) and other Web marketing strategies.
What is linkbait?
It sounds nefarious doesn’t it? Makes one think of “bait and switch” or that run-down old bait store by the lake—the one where they store the containers of nightcrawlers in the same cooler as the egg salad sandwiches. Blech.
In reality linkbait is simply online content designed to attract an audience who will link to your site. But isn’t all content supposed to do that? In theory yes, but linkbait goes one step further. Instead of supplying the usual insight that your readers have come to depend upon, linkbait reaches out beyond your core audience, offering content that is topical, controversial or in some manner more exciting than the usual fare.
Linkbait is like your favorite birthday present. While you appreciated and needed the new sweater, books and CD’s, the Wii/Xbox/bicycle/train set/new puppy/or other object of desire was the one you told your friends about. Linkbait is the content that people tell others about through their blogs, Web site, Facebook pages, Twitters, Pownces, StumbleUpons, etc.
Examples of linkbait
Linkbait is more than supercharged content. It’s content with an edge, this edge could be something like a Top 10 list on a popular topic, a controversial opinion such as a vilification of Firefox (who would do such a thing?!), or a contest offering a popular prize. For example, Fetch Softworks has just announced their Take Fetch Back to School, Win a MacBook Contest (4 runners-up win new iPod Nanos). This contest should be great linkbait because it is geared to students, staff and faculty just beginning the academic year; offers great prizes; and is happening at a time when some of those prizes, the new iPods, are making a lot of news. As a user of their product I’ve already started pondering what to write, and as a blogger I’ve already linked to them. So I think it’s working!
When to use linkbait?
Like your birthday present, linkbait is for special occasions, meant to add to your content rather than replace it. The bait is only part of the overall mix. If you tried to use linkbait in every blog post you would soon end up with a site lacking in continuity. That wouldn’t really serve your goals. But on occasion, if you come up with a clever idea that is related to your goals, adds value to your regular content, and attracts attention, then go for it. Strategic bits of linkbait can help you expand your readership, acquire more incoming links and raise your rankings while adding a bit of excitement for your regular readers.
I could proceed to bore you with more details, caveats, pros and cons, but plenty of others have already written on the topic. If you are considering adding linkbait to your marketing toolkit, the following resources should give you much of what you need to know.
Linkbait Resources
- Wikipedia: Link bait
- Matt Cutts: SEO Advice: linkbait and linkbaiting
- An Introduction to Linkbaiting
- Golden Rules of Linkbaiting
- Andy Hagans’ Ultimate Guide to Linkbaiting and SMM
- The Art of Linkbaiting
- 2007 Guide To Linkbaiting: The Year Of Widgetbait?
- Leveraging Linkbait
- 5 Link Baiting Methods
A picture is worth a thousand words, but that’s not always enough
How to add captions to images in Photoshop On the Web it is preferable to place your caption in the HTML. If that won’t work and your captions are long, you should also link to a place providing a description of the image and an alternative rendering of the text. Colleagues of mine are involved in a project that involves adding captions to photographs. Like many of you, they aren’t full-time designers and haven’t spent a great deal of time using Photoshop. While they know how to crop and resize photos, they’ve not yet worked with type. For those…
How to add captions to images in Photoshop

On the Web it is preferable to place your
caption in the HTML. If that won’t work and
your captions are long, you should also link
to a place providing a description of the image
and an alternative rendering of the text.
Colleagues of mine are involved in a project that involves adding captions to photographs. Like many of you, they aren’t full-time designers and haven’t spent a great deal of time using Photoshop. While they know how to crop and resize photos, they’ve not yet worked with type. For those of you who may someday face the same situation, here is a quick tutorial on adding text to images.
Establish your project parameters—size matters
Are your captioned images going to be used on the Web, on hand-outs produced by your office printer or in commercial print work such as a magazine? At what size will they be used? When editing your photos you will want to start with the largest image file available, crop it as necessary then resize it to your project specifications before adding your text.
As I mentioned when discussing image formats, your usage will impact your size specifications. Generally you will want an image that is 300 pixels per inch (ppi) for commercially printed pieces, one that is 125-250 for desktop printing (refer to your user manual to determine the maximum dots per inch (dpi) your printer will produce) and somewhere around 72 to 100 for the Web.
Note: measurements for print are exact; if your photo is 300 dpi and 1 inch square, it will be printed to be exactly 1 inch square. If you print it at 72 dpi and 1 inch square it will still be exactly 1 inch, but will have less detail. Measurements for the Web are relative because they are determined by your display. On my Dell there are 77 pixels in an inch, while on my Mac there are 98 pixels in an inch. Your display may be different. As a rule of thumb I just use 72 (which was common for most monitors back in the 1990’s) and keep in mind the fact that a 3 inch wide photo at 72 ppi will appear smaller on the Mac than it will on the PC. Either way it is 216 pixels, but the pixels on my PC are bigger than those on my Mac.
Consider the amount of text you are supposed to add to the image. Try to make this as brief as possible, especially if your project will be viewed online. While you may be able to use tiny type on printed matter, that type will be harder to read online. Fewer pixels mean fewer details, so 6 point type online will be tiny and jagged.
Also ask yourself if the text needs to be on the picture itself or if it can be read as a caption underneath the image. If the project is for the Web you can include captions underneath a photo in the text rather than in the image. For situations where that won’t work, such as HTML e-mail, just be sure to repeat your caption text in the alt tag of the image.
How to add text in Photoshop

For this example I’ll walk you through the steps used to caption the image used in this entry. We’ll add text on top of the image and below.
- Open your image file in Photoshop. For practice you are welcome to use this sculpture photo.
- Resize your photo by going to the image menu and choosing image size. Make sure that “constrain proportions” and “resample image” are checked. Set the resolution appropriate to your project. I’m using 72 for the Web. Choose the appropriate width for your image. I’m using 240 pixels.
- Click on the foreground color and use the picker to select a color for your type.
- Select the type tool, click and drag on the image to create your type box, and start typing your content.
- If it is not already open, go to the window menu to open your layers window. Note that your type was created on a new layer.
- Switch to the selection tool to reposition your type as desired.
- If your background is too busy, you may find that your type is hard to read. Try adding a drop shadow or outline to it. To do this double click on the type layer in the layers menu (click to the right of the layer name). This will open up the layers style menu. Check drop shadow, then click on the words “drop shadow” to see your parameters. Drag the menu somewhere to the side—so you can still see your type—then adjust the angle, spread, size and distance until your type looks clear. You can experiment with drop shadow and other options to create different effects. You can also try changing the color. Just try to keep it simple. (Hot pink type with a lime drop shadow is usually a no-no—unless you’re competition is “Hello Kitty.”)
- Sometimes a drop shadow isn’t enough. In this case you may want to experiment with darkening the background behind the type. You can use the burn tool to just darken an area (paint over your background with this). Another option is to create a rectangular area behind your type that is darker than the rest. To do this, create a new layer above your background image, and create a rectangle with your selection tool. Using the paint can fill it with black. Now you can leave it as is to call greater attention to your caption, or you can adjust the opacity to the layer to make it semi-transparent. In the sculpture photo I’ve set the opacity of the black square to 50%.
- If you’ve tried a few of these options and your type still doesn’t look right, you may want to put the caption below, instead of on, the image. If your project is going on the Web you can do this in your HTML. If you are sending an HTML e-mail though you will want to include it in the image file. To do this, you will need to increase the size of the image. First set your background color to be the same as that of your document. In this example I’ll use white. Next go to the image menu and select canvas size. Click on the center top square in the grid then increase your height measurement to an appropriate size. I’ve added .5 inches. Now just add your type to this area. If you’ve added too much space you can crop accordingly.
- Save the file in Photoshop format (in case you want to make edits) then go to the file menu and choose “Save for Web.” Select JPEG as your file format then click save. If you would prefer a .tif file (for print) you would instead flatten image (under the layers menu) and save as .tif.
Alternative Text for Captions
Captioning images can add value, but will also pose accessibility challenges. If your caption is short, you should copy it into the alt tag of your image. This will make it available to those who use screen readers or other user agents that don’t show images. If your caption is too long, you may also want to link to an alternative copy of the text, either on the same page, as a footnote, or wherever you deem appropriate. Read Andy Clarke’s article, Accessible alternatives, to learn more about these techniques. For this example I’ve linked to a description of the image and text and placed it here on the page:
Photograph of part of a sculpture featuring a man holding an umbrella next to a dog whose nose is pointed at the mans’s knee. Captions built into the image read as follows:
- Spot, can’t you find a squirrel to chase? You’ve been sitting here panting on my leg for years now.
- What, and you think this is my idea of the perfect view?
- This caption is part of the image file, but sits below the picture.
In Conclusion
As you’ve seen, it’s pretty easy to add text to an image. The tricky part is making it look right and ensuring that the content is available to all. But with a bit of experimentation you can accomplish both tasks.

