Monday, October 4, 2010

Left Outer Join

I am not going in detail about the definition of 'Left Outer join' , hopefully every one is aware of that but wanted to mention in this post about a mistake(bug) in a query I had seen somewhere on the net recently.Explained below is a similar situation .

The user wanted to know the total number of employees in each department and used the below query (Q1) for this. If we observe the query-Q1 , Its clear that GROUP BY will be done AFTER the Left Outer JOIN is done in this case and also observe that the Count(*) was used in this query. Actually uisng Count(*) in this context was a bug and becuase of this -we would definitely get AT LEAST one row though were no employees at all in that department in the emp table.The solution is to rewrite Query1 as Query2 - just replace '*' by e.dept_no in the count() .this is a sort of a logical mistake we all often do .

Q1 (Wrong):
SQL> select d.dept_no, count(*) from dept d, emp e
  2  where d.dept_no=e.dept_no(+)
  3  group by d.dept_no;


   DEPT_NO   COUNT(*)
---------- ----------
         1          1
         6          1
         2        547
         4        456
         5          1
         3       452


6 rows selected.
Elapsed: 00:00:00.04
Execution Plan
----------------------------------------------------------
Plan hash value: 2197210491

---------------------------------------------------------------------------------------
| Id  | Operation              | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                             |         3 |    18 |    43   (7)| 00:00:01 |
|   1 |  HASH GROUP BY                |                             |         3 |    18 |    43   (7)| 00:00:01 |
|*  2 |   HASH JOIN OUTER           |                             | 65535 |   383K|    41   (3)| 00:00:01 |
|   3 |    INDEX FULL SCAN           | SYS_C0024561  |         6 |    18 |     1   (0)| 00:00:01 |
|   4 |    INDEX FAST FULL SCAN| DEPT_NO_IDX  | 65535 |   191K|    39   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("D"."DEPT_NO"="E"."DEPT_NO"(+))

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         25  consistent gets
          0  physical reads
          0  redo size
        167  bytes sent via SQL*Net to client
        237  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          6  rows processed



Q2(Correct):
SQL> select d.dept_no, count(e.dept_no) from dept d, emp e
  2  where d.dept_no=e.dept_no(+)
  3  group by d.dept_no;


   DEPT_NO COUNT(E.DEPT_NO)
---------- ----------------
         1                0
         6                0
         2             547
         4             456

         5                0
         3             452

6 rows selected.
Elapsed: 00:00:00.06
Execution Plan
----------------------------------------------------------
Plan hash value: 2197210491

---------------------------------------------------------------------------------------
| Id  | Operation              | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |                             |     3 |    18 |    43   (7)| 00:00:01 |
|   1 |  HASH GROUP BY               |                             |     3 |    18 |    43   (7)| 00:00:01 |
|*  2 |   HASH JOIN OUTER          |                              | 65535 |   383K|    41   (3)| 00:00:01 |
|   3 |    INDEX FULL SCAN          | SYS_C0024561   |     6 |    18 |     1   (0)| 00:00:01 |
|   4 |    INDEX FAST FULL SCAN| DEPT_NO_IDX  | 65535 |   191K|    39   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("D"."DEPT_NO"="E"."DEPT_NO"(+))

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        25 consistent gets
          0  physical reads
          0  redo size
        167 bytes sent via SQL*Net to client
        237 bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          6  rows processed



By the way - there is another way of writing this query Q2 using 'Subquery' as shown in Q3:

Q3:
SQL> select dept_no,(select count(*) from emp where emp.dept_no=dept.dept_no) count
  2  from dept ;


   DEPT_NO      COUNT
---------- ----------
         3             452        

         1                0
         5                0
         
6                0
         2             547
         4             456

                 
6 rows selected.
Elapsed: 00:00:00.06
Execution Plan
----------------------------------------------------------
Plan hash value: 2535931830

--------------------------------------------------------------------------------------
| Id  | Operation             | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                              |     6 |    18 |     1   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE             |                             |     1 |     3 |            |          |
|*  2 |   INDEX FAST FULL SCAN| DEPT_NO_IDX  | 21845 | 65535 |    39   (0)| 00:00:01 |
|   3 |  INDEX FULL SCAN             | SYS_C0024561 |     6 |    18 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("EMP"."DEPT_NO"=:B1)

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       112  consistent gets
          0  physical reads
          0  redo size
       287 bytes sent via SQL*Net to client
       189 bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          6  rows processed


See the logical i/os - so which query is efficient ???
By looking at the 'nature' of the two queries Q2 and Q3 - its clear that

Q3: would be 'nice and efficient' if dept is large and we want the 'first records' as soon as possible .
(It would take one department_no from dept table and go to employee table , get the count of employees for the department_no and give to us , then again take one more department_no from dept table and go to employee table ,get the count and give to us and so on ..........)
Q2 :would be efficient - if we want ALL the rows at a time as soon as possible regardless the table size.
(It will get the two tables first ,JOIN the two tables (left outer join) and then count the count the no of employees and output ALL the rows at a time)

Initial response (initial rows) is quick with Q3 but whole response (all rows) is fast with Q2.

In the above example - dept is very small and we were looking for ALL rows really -
so Q2 (left outer join ) is efficient than Q3( subquery).So there is no 'one for all rule' in Oracle - it all 'depends' on many factors which one is good approach and which one is a bad one .




ORACLE and ANSI Syntax for Left Outer JOIN :


ORACLE Syntax :

Q1:
select  *  from emp e, dept d
where e.dept_no=d.dept_no and d.dept_no=7

Q2:
select  *  from emp e , dept d
where e.dept_no=d.dept_no (+) and d.dept_no=7

Q3:
select  *  from emp e , dept d
where e.dept_no=d.dept_no (+) and d.dept_no(+)=7



ANSI Syntax:

Q4:
select  * from emp e left outer join dept d 
on (e.dept_no = d.dept_no and d.dept_no = 7) 

Q5:
select  *  from emp e left outer join dept d 
on (e.dept_no = d.dept_no) where d.dept_no = 7

Observations after looking at the test results that :

1.Q1=Q2  ->Pretty clear that you will get the records from EMP and DEPT where the  
                 dept_no=7 only


2. Q3-> Will filter the DEPT table first(dept_no=7) then EMP table will be OUTER   
          JOINed with DEPT (every record from EMP table will appear in the out put 
          but only the recods with dep_no=7 appear from DEPT table and the rest will be NULLs from DEPT)

3.Q3=Q4


4. Q1=Q2=Q5


I always recommend -Use ANSI Syntax instead of Oracle's own Syntax as ANSI's syntax is Standardised and is supposed to be used irrespective of the Database Vendor , portable and universally accepted and recognised .

No comments:

Post a Comment