Official MySQL documentation provides information that using regular expressions is "powerful way of specifying a pattern for a complex search". Is it really such a powerful way of filtering and should be used, or is it a solution that should be avoided? As it usually happens in real life, there are many opinions and no universal answer. Unfortunately, it often turns out that the truth lies somewhere in the middle.
One of my clients asked me yesterday for a little help with a query badly hitting performance of their production server. They were complaining about performance of REGEXP
powered query and asked for advice on how to make it efficient.
Original query used in customer's application:
SELECT id FROM list WHERE user_name REGEXP '^bulba[0-9]+$';
Of course `list` table was properly indexed.
Explain:
mysql> explain SELECT SQL_NO_CACHE id FROM list WHERE user_name REGEXP '^bulba[0-9]+$'\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: list type: index possible_keys: NULL key: id7 key_len: 24 ref: NULL rows: 4175201 Extra: Using where; Using index 1 row in set (0.00 sec)
Execution plan pasted above shows that MySQL was using index scan (type: index
), which is faster way to get your data than table scan, but still, it's relatively slow and overall performance strongly depends on amount of data it has to parse. Customer case is quite a good example of it. Even assuming that InnoDB buffer pool is big enough to cover all index pages and almost no IO will be needed, it's still over four million rows to examine.
Can we deal with it somehow?
I was playing with it a bit trying different alternatives and here is what I found as much more efficient replacement to original query:
SELECT id FROM list WHERE user_name LIKE 'bulba%' and user_name REGEXP 'bulba[0-9]';
Execution plan again:
mysql> explain SELECT SQL_NO_CACHE id FROM list WHERE user_name LIKE 'bulba%' AND user_name REGEXP 'bulba[0-9]'\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: list type: range possible_keys: id7 key: id7 key_len: 19 ref: NULL rows: 31 Extra: Using where; Using index 1 row in set (0.00 sec)
Explain looks better. Now, instead of index scan MySQL will use 'id7' to find rows matching criteria with minimum overhead. This is because MySQL can't really use index when REGEXP
is the only filtering used. New one, in fact, is doing primary filtering with LIKE
, REGEXP
just extends it.
And finally, optimization benefit (timings for both queries):
Original query by customer:
19 rows in set (4.96 sec)
Fixed one:
19 rows in set (0.01 sec)
It seems that we were able to achieve the goal.
Of course it does not mean that all the queries using the REGEXP
are "bad". That is only a reminder that special care should be taken when writing SQL code with the use of this very nice functionality. You should aware of queries with REGEXP
as the only filtering role.