Tokenization using regular expression sub patterns

Tokenization using regular expression sub patterns
A while back was writing some stuff on this blog about regular expressoins. While that remains unfinished, a mini regex example - nothing earth shattering but a useful technique if you hadn’t already seen it. Promtped by a real world example, one often-overlooked feature of most regular expressions engines is how subpatterns can useful to whip […]

A while back was writing some stuff on this blog about regular expressoins. While that remains unfinished, a mini regex example - nothing earth shattering but a useful technique if you hadn’t already seen it.

Promtped by a real world example, one often-overlooked feature of most regular expressions engines is how subpatterns can useful to whip up tokenizers relatively easily. The problem? I needed to match the word any of the words “Canton”, “Region” or “Group” in a string and perform a follow up action depending on which matched.

Dealing with four main languages in Switzerland ( German, French, Italian and English), it get’s a bit more interesting; “Canton” translates to “Kanton” in German and “Cantone” in Italian, while “Region” is “Regione” in Italian. and Group is “Gruppe”, “Groupe” and “Gruppo” in German, French and Italian respectively. Composing those into three straightforward regular expressions I have;

  • Canton: /cantone?|kanton/i
  • Region: /regione?/i
  • Group: /groupe?|grupp(?:o|e)/i

Now on examining some input string, I could try testing each of those regexes individually against the string but that’s a) inefficient and b) likely to lead to lengthier code. Instead I make a single regular expression using sub patterns: /(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i …then figure out which sub pattern matched after a match is made.

Note that technically this problem is not really one of tokenization but rather just classifying the input with a common name, but the technique can be fairly easily extended. In PHP the solution is courtesy of the third argument to preg_match(), for example;

   $inputs = array( 'Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe');    foreach ( $inputs as $input ) {      preg_match("/(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i", $input, $matches);      print_r($matches);  }   

I get output like;

 Array (     [0] => Kanton     [1] => Kanton ) Array (     [0] => Regione     [1] =>     [2] => Regione ) Array (     [0] => Gruppe     [1] =>     [2] =>     [3] => Gruppe ) 

Notice how the first element of this array in always what I matched while elements indexed 1+ correspond to the position of subpattern I matched against, from left to right in the pattern - this I can use to tell me what I actually matched e.g.;

   $inputs = array( 'Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe');  $tokens = array('canton','region','group'); // the token names    foreach ( $inputs as $input ) {            if ( preg_match("/(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i", $input, $matches) ) {                    foreach ( array_keys( $matches) as $key) {              if ( $key == 0 ) { continue; } // skip the first element                            // Look for the subpattern we matched...              if ( $matches[$key] != "" ) {                  printf("Input: '%s',  Token: '%s'\n", $input, $tokens[$key-1]);              }          }      }  }   

Which gives me output like;

 Input: 'Kanton Zuerich',  Token: 'canton' Input: 'Frauenfeld Regione',  Token: 'region' Input: 'Fricktal Gruppe',  Token: 'group' 

…so I’m now able to classify the input to one of a set of known tokens and react accordingly. Most regex. apis provide something along this lines, for example here’s the same (and much cleaner) in Python, which is what I actually used on this problem;

   import re    p = re.compile('(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))', re.I)  inputs = ('Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe')  tokens = ('canton','region','group')    for input in inputs:      m  = p.search(input)      if not m: continue      for group, token in zip(m.groups(), tokens):          if group is not None:              print "Input: '%s', Token: '%s'" % ( input, token )   

Could be reduced further using list comprehensions but don’t think it helps readability in this case.

An alternative problem to give you a feel for how this technique can be applied. Let’s say you want to parse an HTML document and list a subset of the block level vs. the inline level tags it contains. You might do this with two sub-patterns e.g. ()|() (note this regex as-is is geared to python’s idea of greediness - you’d need to change it for PHP) leading to something like this is python;

   p = re.compile('()|()’)    for match in p.finditer(’foo
test bar test 1
bar’): print “[pos: %s] matched %s” % ( match.start(), str(match.groups()) )

The call to match.groups() returns a tuple which tells you which sub pattern matched while match.start() tells you the character position in the document where the match was made, allowing you to pull substrings out of the document.

This article provided by sitepoint.com.


CSS Gallery Attacks
53 Css Most Updated Galleries With The Best Designs , Vote For The Best Ones.

WordPress.com Still Growing
Compete has released a list of the fastest growing and declining sites of 2007. These stats are made up of the top 1,000 domains in between December of 2006 and December of 2007. Among those domains that grew the most (and that are safe for work) include, iamfreetonight.com, podshow.com and techcrunch.com. The domains that saw […]

Compete has released a list of the fastest growing and declining sites of 2007. These stats are made up of the top 1,000 domains in between December of 2006 and December of 2007. Among those domains that grew the most (and that are safe for work) include, iamfreetonight.com, podshow.com and techcrunch.com. The domains that saw a negative change of at least 90% include bolt.com (due to bankruptcy) broadcaster.com and octanetv.com.

However, WordPress.com appears to have grown by 523% with 24,393,457 visits. WordPress doesn’t appear to be slowing down anytime soon and thats some positive news.

Site Rankings From 2007

WordPress Plugins and Theme Releases for 1/17
Theme Releases Two Column Themes GreenTech GreenTech is a two column theme with pleasant colors. It makes use of a mix of brown and green colors. The background is creme in color. There are plenty of advertisement options available in this theme. Author comments on the blog are styled differently than the other comments. Overall a good looking theme. Widget […]

Theme Releases

Two Column Themes

GreenTech

greentech-thumbnail.png

GreenTech is a two column theme with pleasant colors. It makes use of a mix of brown and green colors. The background is creme in color. There are plenty of advertisement options available in this theme.

Author comments on the blog are styled differently than the other comments. Overall a good looking theme.

Widget Ready: Yes

Compatibility: There were no issues that I saw with this theme on Firefox 2+, IE6, IE7, Flock. The header section appears broken in Opera.

Validations: Invalid XHTML 1.0 Transitional with 29 errors | Invalid CSS with 1 error

Demo | Release Page | Download

Three Column Themes

My Starcraft 2

startcraft2-thumbnail.png

My Starcraft 2 is a theme based on Starcraft 2. The theme uses dark and vibrant colors with a mix of black and gray. The links are orange color making it more visible on the darker background.

Overall looks are quite good with ample advertising options. The theme is available in English and German versions

Widget Ready: Yes

Compatibility: There were no issues that I saw with this theme on Firefox 2+, IE6, IE7, Flock and Opera browsers.

Validations: Invalid XHTML 1.0 Transitional with 11 errors | Invalid CSS with 3 errors

Demo / Release Page /Download

Four Column Themes

Techicon

techicon-thumbnail.png

Techicon is a wide 4 column theme with beautiful use of colors. It is simple and classy at the same time. The theme comes with a sub header with 3 columns which can be used to display information such as Latest posts, popular posts etc.

The main content area is quite wide and can easily accommodate large images, the sidebar is huge and is made up of three smaller sidebars. Overall a great theme with ample advertising options.

Widget Ready: Yes

Compatibility: There were no issues that I saw with this theme on Firefox 2+, IE6, IE7, Flock and Opera browsers.

Validations: Valid XHTML 1.0 Transitional | Invalid CSS with 3 errors

Demo | Release Page | Download

Plugin Releases

Set Email “From” Address

The default emails that get sent out by WordPress is usually wordpress@yourdomain.com. This plugin allows you to change the from address to any email you want to so that all outgoing emails go out with your personalized email address.

Compatible Versions: WordPress 2.0 and above

Category: Administration

Release page | Download

SmartLinks Widget

This plugin allows you to automatically display content from Amazon, Netflix, Last.fm, IMDB etc. The items have SmartLink rather than links which opens up a small inline window with the best information from around the web without the user have to leave your site. (Disclosure: SmartLinks is an advertiser on this blog)

[EDIT] from the comments:

Smartlinks offers personalized Widgets for Netflix Queue, Amazon Wishlist, and Last.fm playlists. Each contains a SmartLink that allows you to monetize the content in a number of ways.

Compatible Versions: Up To WordPress 2.3.2Category: Widgets

Release Page | Download

Berri Technorati Reactions on Dashboard

WordPress 2.3 and above displays incoming links from Google Blog Search, this plugin displays you the Technorati Reactions of your blog in the Admin Dashboard.

Compatible Versions: WordPress 2.3 and above

Category: Administration

Release Page | Download

WordPress Plugin Releases for 1 / 25
AutoInfo Autoinfo is a plugin which allows you to show information such as users online, registered users, feed subscribers, number of posts, number of ping backs, top 3 commented posts, comments, comments per post, top three commentators and more. Release Page | Download Socialize Me Socialize Me is a plugin which allows you to show custom messages to users […]

AutoInfo

Autoinfo is a plugin which allows you to show information such as users online, registered users, feed subscribers, number of posts, number of ping backs, top 3 commented posts, comments, comments per post, top three commentators and more.

Release Page | Download

Socialize Me

Socialize Me is a plugin which allows you to show custom messages to users visiting your site from Social Networking sites like StumbleUpon, Facebook, Digg, Delicious, Pownce, Twitter, Bebo and more.

You can customize each of the messages that will be shown to the user.

Release Page | Download

OutOfDate

OutOfDate is a plugin which shows a message above all the posts older than the specified number of months. Provides and option to customize the message, layout and number of months beyond which posts should carry the message.

Release Page | Download

Blogger to WordPress Redirection

The plugin allows you to redirect individual blogger posts to their respective posts in WordPress. The redirection will help you send search engine users to the right post on your new WordPress blog.

Release Page | Download

Admin Favicon

The plugin allows you to add a custom favicon for your WordPress Admin panel. Can help you to easily distinguish between admin and non admin tabs for your site.

Release Page | Download

AfLinks

AfLinks allows you to insert affiliate links into WordPress content. The plugin shows a little popup on mouse hover with a image and description of the product. The plugin is compatible for webmasters having an account with affili.net.

Release Page | Download

Leave a Reply

You must be logged in to post a comment.