Counting items with mysql and regex in the group by statement

Yesterday I needed to do a little report for the boss but the data that I needed to group together wasn’t exactly the same. I hit up the “Goog” for ways to use regular expressions in a count query, but that is kinda tricky to find so this is what I was able to come up with….

So let’s say you had a table where you were attaching the landing url a site visiting is entering your website on, and you wanted to track that landing url the whole way to the end of some web based procedure. Of course certain urls will be different but some could be common for the purpose of counting them together on a report. In a practical example, lets say you are doing pay-per-click traffic with Google and Yahoo. Both of these providers use unique ids in the url to help distinguish between multiple clicks from the same person (so you end up with many urls that are different but mean the same thing). And of course you were a good little developer when you decided to send Google traffic to /google/ and Yahoo to /yahoo/!

With that premise and the following fields in a database table called “signup”, you can group the unique Google and Yahoo landing pages together with a simple regular experssion.

signup table

+----------------------+
| Field
+----------------------+
| id
| signup_date
| landing_page
| etc....
+----------------------+

SELECT
DATE_FORMAT(signup_date,’%m-%d-%Y’) as signup_date,
count(landing_page) as signup_count,
IF(landing_page REGEXP ‘^/(google|yahoo).*’,SUBSTRING(landing_page,1,7),landing_page) as lp
FROM signup
WHERE
1=1
AND signup_date >= ‘” . $istartdate . “‘
AND signup_date <= ‘” . $ienddate . “‘
GROUP BY DATE_FORMAT(signup_date,’%m-%d-%Y’), IF(landing_page REGEXP ‘^/(google|yahoo).*’,SUBSTRING(landing_page,1,7),landing_page)

If that makes no sense whatsoever…. awesome!

When I make a report based on grouping data by day, I select the date for each row, then the count of the column I am looking at, and in this example the field that I am grouping to look at. (This query is sitting behind a web form so the php variables for start and end date are passed in.) In the GROUP BY clause, if you specify the first grouping as the formatted date, you can base the format on the way you are outputting the information on the report. In this example I am doing the report by day. If you wanted to grab the values by month, you would just change the date format pattern from this: ‘%m-%d-%Y’ to ‘%m-%Y’. The tricky part in this query was I need to have one grouping condition that would pull together all the like landing page urls as each value. I chose to use an if statement IF(Condition,true,false) where the condition was a regular expression pattern. Using the condition = landing_page REGEXP ‘^/(google|yahoo).* ‘ tells MySQL that IF the landing_page starts with ‘/google’ or ‘/yahoo’ return the true value, otherwise return the false value. I was not able to return the entire value, so using SUBSTRING will chop off the characters that you want and ignore the rest. If the pattern fails to match it will return the normal value for landing_page. And that is all there is to it.

Let me know if that was helpful to anyone!

3 Responses to “Counting items with mysql and regex in the group by statement”


  1. 1 Jim D

    Regular expressions are cool and all, but man… that is nightmare for the query optimizer. If you do an EXPLAIN on that bad boy, you’ll see what I mean.

    If you have the ability, I’d highly recommend adding an extra field to that table that stores the referrer as either a numeric id that maps to a separate referrer table, or an enum that you can tack onto later. Do the regular expression magic in PHP and store the value during INSERT. MUCH faster in the long run.

    If you take the enum route (google, yahoo, other), your query can simplify to:

    SELECT signup_date, COUNT(landing_page) AS signup_count, referrer
    FROM signup
    WHERE (signup_date BETWEEN :start_date AND :end_date)
    GROUP BY referrer, signup_date
    ORDER BY NULL

    If signup_date is not already a DATE type, I would add another column that just stores the date portion of the timestamp. It makes the index much more efficient. (You ARE using indices, right? :-))

    ALTER IGNORE TABLE `signup` ADD INDEX `referrer` (`referrer`, `signup_date`);

  2. 2 Alex Barger

    Awesome stuff man! You are learning a lot at that company you are at now. I miss the old days of white boarding these concepts out and really optimizing the hell out of the process!

  3. 3 David Woods

    That’s slick, I love using REGEX pretty much wherever I can.

    For gleaning info from tables that you don’t have control of, that’s a good way of doing things.

    If you have control over the table, and this is going to be a frequently run piece of SQL, then you’d take some serious load off of the server by adding a new data field that will contain either nothing or, in the case of website that you have verified match the regex in the scripting language, the landing page id you strip from the url with the script language.

    Has MySQL implemented triggers yet? I’ve never used them, but I would imagine that processing and saving that extra field would be a good place to use them. I should check out what’s come out of the latest MySQL updates…

Comments are currently closed.