MxBlog

SQL is about what data you want not how to get it

Does it matter that you duplicate some where conditions in your queries? It shouldn't as SQL queries should state what data you want to see, not how the database should optimally retrieve that data. This means that it shouldn't matter if you specify that you want all records for customer 'X' once, or that you state it 100 times.

Let examine if it really works that way. The following examples have been tested with postgresql 9.1. To know how the database queries your data, you can look at the plan table. We'll look at the impact of the sql statement on the plan determined by PostgreSQL.

First some test data, this creates a table t1 with 1 million records:

create table t1 as 
  select uuid_in(  md5((random()::text) || ((now())::text) || (random()::text))::cstring) as id
  ,      md5(random()::text) as s1
  ,      md5(random()::text) as s2
  ,      n1
  ,      n2 
  from   generate_series(1,10000) as n1
  ,      generate_series(1,100) as n2;

We need a primary key and some indexes:

alter table t1 add primary key (id);
create index t1_n1_idx on t1(n1);
create index t1_n2_idx on t1(n2);

Some simple tests to see if duplicate where expressions make any difference. First the query plan without any duplicate conditions:

mx=# explain analyze select * from t1 where n1 = 10 and n2 = 10 ;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Index Scan using t1_n1_idx on t1  (cost=0.00..11.34 rows=1 width=90) (actual time=0.031..0.068 rows=1 loops=1)
   Index Cond: (n1 = 10)
   Filter: (n2 = 10)
 Total runtime: 0.103 ms
(4 rows

Adding some duplicate conditions doesn't seem to impact the query plan:

mx=# explain analyze select * from t1 where n1 = 10 and n2 = 10 or (n1 = 10 and n2 = 10) and n1 = 10 and n1 = 10 and n2 = 10;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Index Scan using t1_n1_idx on t1  (cost=0.00..11.34 rows=1 width=90) (actual time=0.023..0.042 rows=1 loops=1)
   Index Cond: (n1 = 10)
   Filter: (n2 = 10)
 Total runtime: 0.073 ms
(4 rows)

Even adding an in condition doesn't change anything:

mx=# explain analyze select * from t1 where n1 = 10 and n2 = 10 or (n1 = 10 and n2 = 10) and n1 = 10 and n1 = 10 and n2 = 10 and n1 in (10);
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Index Scan using t1_n1_idx on t1  (cost=0.00..11.34 rows=1 width=90) (actual time=0.024..0.043 rows=1 loops=1)
   Index Cond: (n1 = 10)
   Filter: (n2 = 10)
 Total runtime: 0.075 ms

No real difference, actual duration may be different, depending on how busy your system is, and what data is already in memory. But for the actual query plan, how the database will get all the data, the duplicate expression in the where clause make no difference.

Now for something a bit more complex, joining 2 more tables. Let's copy the data from table t1 to tables t2 and t3:

create table t2 as (select * from t1);
create table t3 as (select * from t1);
alter table t2 add primary key (id);
create index t2_n1_idx on t2(n1);
create index t2_n2_idx on t2(n2);
alter table t3 add primary key (id);
create index t3_n1_idx on t3(n1);
create index t3_n2_idx on t3(n2);

Simple case, joining table t1, t2 and t3 on the columns n1 and n2:

mx=# explain analyze select * from t1,t2,t3 where t1.n1 = t2.n1 and t1.n2 = t2.n2 and t2.n1 =t3.n1 and t2.n2 = t3.n2 and t3.n1 = 10 and t1.n2 = 10;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..34.00 rows=1 width=270) (actual time=0.070..0.143 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..22.67 rows=1 width=180) (actual time=0.047..0.092 rows=1 loops=1)
         ->  Index Scan using t1_n1_idx on t1  (cost=0.00..11.34 rows=1 width=90) (actual time=0.024..0.043 rows=1 loops=1)
               Index Cond: (n1 = 10)
               Filter: (n2 = 10)
         ->  Index Scan using t2_n1_idx on t2  (cost=0.00..11.32 rows=1 width=90) (actual time=0.013..0.031 rows=1 loops=1)
               Index Cond: (n1 = 10)
               Filter: (n2 = 10)
   ->  Index Scan using t3_n1_idx on t3  (cost=0.00..11.32 rows=1 width=90) (actual time=0.012..0.032 rows=1 loops=1)
         Index Cond: (n1 = 10)
         Filter: (n2 = 10)
 Total runtime: 0.191 ms
(12 rows)

Duplicating all where conditions makes no difference:

mx=# explain analyze select * from t1,t2,t3 where t1.n1 = t2.n1 and t1.n2 = t2.n2 and t2.n1 =t3.n1 and t2.n2 = t3.n2 and t3.n1 = 10 and t1.n2 = 10 and t1.n1 = t2.n1 and t1.n2 = t2.n2 and t2.n1 =t3.n1 and t2.n2 = t3.n2 and t3.n1 = 10 and t1.n2 = 10;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..34.00 rows=1 width=270) (actual time=0.104..0.179 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..22.67 rows=1 width=180) (actual time=0.048..0.095 rows=1 loops=1)
         ->  Index Scan using t1_n1_idx on t1  (cost=0.00..11.34 rows=1 width=90) (actual time=0.025..0.045 rows=1 loops=1)
               Index Cond: (n1 = 10)
               Filter: (n2 = 10)
         ->  Index Scan using t2_n1_idx on t2  (cost=0.00..11.32 rows=1 width=90) (actual time=0.014..0.034 rows=1 loops=1)
               Index Cond: (n1 = 10)
               Filter: (n2 = 10)
   ->  Index Scan using t3_n1_idx on t3  (cost=0.00..11.32 rows=1 width=90) (actual time=0.046..0.066 rows=1 loops=1)
         Index Cond: (n1 = 10)
         Filter: (n2 = 10)
 Total runtime: 0.227 ms
(12 rows)

Some extra conditions, including an in expressions and an or also has no impact on the query plan:

mx=# explain analyze select * from t1,t2,t3 where t1.n1 = t2.n1 and t1.n2 = t2.n2 and t2.n1 =t3.n1 and t2.n2 = t3.n2 and t3.n1 = 10 and t1.n2 = 10 and t1.n1 = t2.n1 and t1.n2 = t2.n2 and t2.n1 =t3.n1 and t2.n2 = t3.n2 and t3.n1 = 10 and t1.n2 = 10 and t3.n1 in (10,10,10) and t1.n2 = 10 and t1.n2 = 10 and t1.n2 = 10 and (t1.n2 = 10 or t1.n2 = 10) and t2.n2 = t3.n2;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..34.37 rows=1 width=270) (actual time=0.068..0.141 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..22.67 rows=1 width=180) (actual time=0.048..0.093 rows=1 loops=1)
         ->  Index Scan using t1_n1_idx on t1  (cost=0.00..11.34 rows=1 width=90) (actual time=0.024..0.043 rows=1 loops=1)
               Index Cond: (n1 = 10)
               Filter: (n2 = 10)
         ->  Index Scan using t2_n1_idx on t2  (cost=0.00..11.32 rows=1 width=90) (actual time=0.014..0.032 rows=1 loops=1)
               Index Cond: (n1 = 10)
               Filter: (n2 = 10)
   ->  Index Scan using t3_n1_idx on t3  (cost=0.00..11.69 rows=1 width=90) (actual time=0.013..0.032 rows=1 loops=1)
         Index Cond: (n1 = 10)
         Filter: (n2 = 10)
 Total runtime: 0.226 ms
(12 rows)

What about an extra join between tables t1 and t3?

mx=# explain analyze select * from t1,t2,t3 where t1.n1 = t2.n1 and t1.n2 = t2.n2 and t2.n1 =t3.n1 and t2.n2 = t3.n2 and t3.n1 = 10 and t1.n2 = 10 and t1.n1 = t2.n1 and t1.n2 = t2.n2 and t2.n1 =t3.n1 and t2.n2 = t3.n2 and t3.n1 = 10 and t1.n2 = 10 and t3.n1 in (10,10,10) and t1.n2 = 10 and t1.n2 = 10 and t1.n2 = 10 and (t1.n2 = 10 or t1.n2 = 10) and t2.n2 = t3.n2 and t1.n1 = t3.n1;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..34.37 rows=1 width=270) (actual time=0.068..0.141 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..22.67 rows=1 width=180) (actual time=0.047..0.093 rows=1 loops=1)
         ->  Index Scan using t1_n1_idx on t1  (cost=0.00..11.34 rows=1 width=90) (actual time=0.024..0.043 rows=1 loops=1)
               Index Cond: (n1 = 10)
               Filter: (n2 = 10)
         ->  Index Scan using t2_n1_idx on t2  (cost=0.00..11.32 rows=1 width=90) (actual time=0.014..0.034 rows=1 loops=1)
               Index Cond: (n1 = 10)
               Filter: (n2 = 10)
   ->  Index Scan using t3_n1_idx on t3  (cost=0.00..11.69 rows=1 width=90) (actual time=0.012..0.031 rows=1 loops=1)
         Index Cond: (n1 = 10)
         Filter: (n2 = 10)
 Total runtime: 0.196 ms
(12 rows)

As you can see, duplicate where conditions have no real negative impact on the performance of the query.

Bonus info: we can add some indexes to make the queries above even faster. If you look at the query plan, you can see that all tables are first accessed through an index and then filtered. We can combine both actions into one by creating indexes on both columns.

This is what you need to add:

mx=# create index t1_n1_n2_idx on t1(n1,n2);
mx=# create index t2_n1_n2_idx on t2(n1,n2);
mx=# create index t3_n1_n2_idx on t3(n1,n2);

Now that last query looks like this:

mx=# explain analyze select * from t1,t2,t3 where t1.n1 = t2.n1 and t1.n2 = t2.n2 and t2.n1 =t3.n1 and t2.n2 = t3.n2 and t3.n1 = 10 and t1.n2 = 10 and t1.n1 = t2.n1 and t1.n2 = t2.n2 and t2.n1 =t3.n1 and t2.n2 = t3.n2 and t3.n1 = 10 and t1.n2 = 10 and t3.n1 in (10,10,10) and t1.n2 = 10 and t1.n2 = 10 and t1.n2 = 10 and (t1.n2 = 10 or t1.n2 = 10) and t2.n2 = t3.n2 and t1.n1 = t3.n1;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..25.16 rows=1 width=270) (actual time=0.054..0.077 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..16.77 rows=1 width=180) (actual time=0.038..0.051 rows=1 loops=1)
         ->  Index Scan using t1_n1_n2_idx on t1  (cost=0.00..8.38 rows=1 width=90) (actual time=0.017..0.019 rows=1 loops=1)
               Index Cond: ((n1 = 10) AND (n2 = 10))
         ->  Index Scan using t2_n1_n2_idx on t2  (cost=0.00..8.38 rows=1 width=90) (actual time=0.010..0.013 rows=1 loops=1)
               Index Cond: ((n1 = 10) AND (n2 = 10))
   ->  Index Scan using t3_n1_n2_idx on t3  (cost=0.00..8.38 rows=1 width=90) (actual time=0.007..0.010 rows=1 loops=1)
         Index Cond: ((n1 = 10) AND (n2 = 10))
 Total runtime: 0.129 ms
(9 rows)

Total cost of the query went down from 34 to 25. That's about a 25% performance improvement.