Daniel Barnes

a blog
Recent Posts View tags Contact me

Regular Expression building based on user input

Sometimes we need to depend on the user for some variable format to interpret, and we need them to be able to configure that in our program. However, it's unreasonable to expect somebody to give you a full-blown regular expression with matching groups that you can interpret.

I've come up with a small solution which builds a regular expression based around capturing groups. It's very basic at the moment, with some features missing (such as the ability to have a literal \\Q, \\E, or curly bracket, or having angle brackets in capture groups) -- but for basic uses where they might give you data in a standard but arbitrary manner, this is a solution which allows that to be configured.

So, for example, if somebody has a bunch of files on their system which are named similarly to "Look what you made me do - Taylor Swift.mp3", this program allows those users to specify where the details in that filename are: {title} - {artist}.mp3.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class FormatInterpreter {

    private Pattern regex;

    public FormatInterpreter(String format){
        // formats look like this:
        // "{lastname}, {firstname} lives at {number} {street} in {city}, {state} {zip}"
        // this is a user-readable format that may be entered into a config file
        // we convert this into a regex in order to return this data.
        StringBuilder regex = new StringBuilder("^");
        char[] chars = format.toCharArray();
        int i = 0;
        while(i < chars.length){
            if(chars[i] == '{'){
                StringBuilder capture = new StringBuilder();
                try {
                    while (chars[++i] != '}') {
                        capture.append(chars[i]);
                    }
                } catch(IndexOutOfBoundsException ex){
                    throw new IllegalArgumentException("Formatting string was malformatted. (Unbalanced curly brackets).");
                }
                regex.append("(?<" + capture.toString() + ">.*)");
                i++;
            } else {
                regex.append("\\Q");
                while(i < chars.length && chars[i] != '{'){
                    regex.append(chars[i++]);
                }
                regex.append("\\E");
            }
        }
        regex.append("$");
        this.regex = Pattern.compile(regex.toString());
    }

    public FormatInterpretation read(String s){
        return new FormatInterpretation(regex.matcher(s));
    }

    public static void main(String[] args){
        FormatInterpreter f = new FormatInterpreter("{lastname}, {firstname} lives at {address} in {city}, {state} {zip}");
        System.out.println(f.regex.toString());
        FormatInterpretation fi = f.read("Barnes, Daniel lives at 665 Candyland Dr. in Basalt, CO 81621");
        System.out.print(fi.matched());
        if(fi.matched()) {
            System.out.println(": " + fi.get("firstname") + " in " + fi.get("city"));
        }
    }
}

class FormatInterpretation {

    private Matcher matcher;

    public FormatInterpretation(Matcher matcher){
        this.matcher = matcher;
    }

    public String get(String s){
        return matcher.group(s);
    }

    public boolean matched(){
        return matcher.matches();
    }
}

We build a regular expression which ends up looking something like:

^(?<lastname>.*)\Q, \E(?<firstname>.*)\Q lives at \E(?<address>.*)\Q in \E(?<city>.*)\Q, \E(?<state>.*)\Q \E(?<zip>.*)$

Notice that inside the regex there is the use of .* -- this opens up the risk of having spots where if there are fields containing their separators:

{number} {street} => 123 Candyland Dr.

This runs the risk of {number} containing either 123 Candyland or 123 (depending on greedy settings).

However, everything is anchored, and if you have several fields and unique separators not also used within the capturing groups:

{firstname}&{lastname}|{phone}

This is a great way to allow the user to specify a format and easily use that formatting information as a regex for collecting user data.


By Daniel, on July 17, 2018, 12:05 pm