Thursday, April 09, 2009

RegexConverter on Github!

Update: two more days of work and a huge text file of regular expressions and I know a lot more about the syntax of .Net Regex than even MSDN :)
The third (Apr11) release of the library has a huge load of bug fixes and new features and it is now... [Tadaaaaa]... a stable version. At least I think of it that way :) Go download it.!

I've been working for two days on an idea. What can I do to make those long regular expressions that I always leave in my code more readable and easy to understand without having to compile automatons in your head?

I have first researched on Google regular expression converters. Nothing was even close to what I wanted. Also, on the more scientific side, people were talking a lot about BNFs, as a superset of regular expression syntax. I looked at BNF. Yes, it describes anything in a human readable form, it is used in RFCs but I hate it even more than XML! So XML it is. Most of the inspiration for the code came from this link: Regular expressions and regular grammars

I give you, RegexConverter. It is a library+demo that transforms a Regex into an XML and then back again. The demo application demonstrates (duh!) this by having two panes, one in which to write regex and the other in which to write XML. Changing one, also changes the other. It warns you of errors in both Regex, RegexConverter and then checks if what it got can be safely converted into your changed string!

Please tell me what you think. I believe it can be a real help in understanding regular expressions, some specific ones or regular expression syntax in general, whether one is a pro or a complete noob.

I've worked hard to design the library source in a way that is understandable, I also added comments everywhere. I tried to implement all the specifications of .NET regular expressions from the MSDN site so if you have a regex that is valid but can't be turned into XML or the conversion is not perfect, let me know.

The link is: RegexConverter on Github.

I will update this post with more science and links to my places of inspiration for this, so stay tuned.

RegexConverter Update: Ok, so I haven't updated this post much. Shame on me. I was reminded of this project when I got an email from the FamousWhy site which said RegexConverter "has been granted the Famous Software Award". I know it's probably an automatically generated message and went to many or all of the Codeplex people, but still: automated attention is still attention :)


Centribumble said...

Hi Siderite, Just downloaded this and had play - looks fantastic. Great work!

Siderite said...

You've made my day! A comment after only 3 hours and a half after publication! Thanks!
Oh, and your blog is nice, but it needs an RSS feed ;)

Dev said...

Good Article.....


Note: Please remove word verification

Siderite said...

Thank you, Dev. I cannot remove the word verification because then the comments would be spammed continuously. I am spammed just as well by human bots entering all kind of stupid links.

Steven L. said...

Dude, this is awesome. I'm very interested in this kind of thing because I'm interested in constructing something similar for JavaScript (where a regex pattern can be converted into an Object representation of it, and vice versa).

I'd love to spend more time playing with this, but since I can't right now I'll just give you my notes from my five minutes playing with it so far.

- It doesn't support groups in lookaround.
- It claims that groups cannot be empty, although empty groups are not a syntax error in any regular expression flavor (there can be good reasons for empty groups like () or (?:)).
- You should probably clear out the XML when there is a regex parsing error.
- Using, e.g., <group name="2"> when there are not 2 capturing groups returns "Original xml and regenerated xml are not the same. Please contact the author." Perhaps it should result in (?<2>), instead.
- Unnamed capturing groups are still given a name (e.g., <group name="1">) when Explicit Capture is enabled.
- Atomic group (?>x) is converted to <greedy>x</greedy>, although it should probably be <atomicGroup>x</atomicGroup> or similar (atomic groups aren't greedy, although they could be described as possessive, which is a different thing in regex lingo).
- I'm not sure what I think about representing quantifiers as just a property with the quantifier string as its value, since both humans and applications using your XML representation would then need to parse the quantifer themselves. What about using, e.g., quantifierMin="0" quantifierMax="infinity" quantifierType="greedy" for the asterisk quantifier? (I used a quantifier type string instead of a boolean for whether its greedy in case .NET adds support for possessive quantifiers in the future.)
- Conditionals should not be given group names.
- Properties like conditional="?=x" don't sit well with me for the same reasons as I mentioned for quantifiers. The condition type could be specified (e.g., positive lookahead or group participation), and the name of the group could be noted separately for group-participation-based conditionals and the full pattern broken down for lookaround-based conditionals.

But yeah, great stuff so far! Thanks for the demo, since I'm not a C# guy. :)

Siderite said...

Thanks for the comment, Steve!
Here is my answer:

The thing is I implemented the lookaround and other less used features in a pretty patchy way. I will probably have to rewrite that part. That's why the inside of lookaround doesn't really support anything else than escaped characters. I will also count the brackets, though :)

I already noticed the empty group error, I was planning to work on it.

In this implementation the name is the group index if numeric. Maybe I should use a separate attribute called index.

Explicit capture is not implemented at this moment, you are correct.

Conditionals names. Indeed.

- clearing the XML (or regex, you can change both!) is a good idea
- atomicGroup does sound better than greedy

About the quantifier in another format, I don't think it is a good idea. The concept of my converter is to increase readability. Adding all kinds of attributes becomes troublesome. In the end, you need to know what a quantifier is in order to use regex anyway.

About the conditionals and lookarounds being split into constituent expressions, it is a good idea, but as I mentioned, the implementation is a bit naive.

And I am open about doing a javascript version, although with all the difference in implementation I wonder if it won't be more trouble than it is worth.

Anyway, I really appreciate your support and I will work on improving this.

NoSlack913 said...

Have you ever thought about abstracting this as a way to convert (if possible) regex to a search predicate. like say a lucene search? Imagine being able to convert a very cool regex into a more simple to human read lucene (and back).

great work, look forward to digging around in the c#.

Siderite said...

The purpose of the Regex Converter was to make regular expressions convertible and humanly readable. I've never worked with Lucene, but it seems just like I was looking for.

My plan was to create either utilities or XSLT to take that XML and convert it into other things, like Lucene maybe or like BNF or EBNF. Got lazy though.