4 Summarizing with SUM( ) and AVG( )
Tải bản đầy đủ  0trang
SUM( ) and AVG( ) are strictly numeric functions, so they can't be used with strings or
temporal values. On the other hand, sometimes you can convert nonnumeric values to useful
numeric forms. Suppose a table stores TIME values that represent elapsed time:
mysql> SELECT t1 FROM time_val;
++
 t1

++
 15:00:00 
 05:01:30 
 12:30:20 
++
To compute the total elapsed time, use TIME_TO_SEC( ) to convert the values to seconds
before summing them. The result also will be in seconds; pass it to SEC_TO_TIME( ) should
you wish the sum to be in TIME format:
mysql> SELECT SUM(TIME_TO_SEC(t1)) AS 'total seconds',
> SEC_TO_TIME(SUM(TIME_TO_SEC(t1))) AS 'total time'
> FROM time_val;
+++
 total seconds  total time 
+++

117110  32:31:50

+++
7.4.4 See Also
The SUM( ) and AVG( ) functions are especially useful in applications that compute statistics.
They're explored further in Chapter 13, along with STD( ), a related function that calculates
standard deviations.
7.5 Using DISTINCT to Eliminate Duplicates
7.5.1 Problem
You want to know which values are present in a set of values, without listing duplicate values
a bunch of times. Or you want to know how many distinct values there are.
7.5.2 Solution
Use DISTINCT to select unique values, or COUNT(DISTINCT) to count them.
7.5.3 Discussion
A summary operation that doesn't use aggregate functions is to determine which values or
rows are contained in a dataset by eliminating duplicates. Do this with DISTINCT (or
DISTINCTROW, which is synonymous). DISTINCT is useful for boiling down a query result,
and often is combined with ORDER BY to place the values in more meaningful order. For
example, if you want to know the names of the drivers listed in the driver_log table, use
the following query:
mysql> SELECT DISTINCT name FROM driver_log ORDER BY name;
++
 name 
++
 Ben

 Henry 
 Suzi 
++
A query without DISTINCT produces the same names, but is not nearly as easy to
understand:
mysql> SELECT name FROM driver_log;
++
 name 
++
 Ben

 Suzi 
 Henry 
 Henry 
 Ben

 Henry 
 Suzi 
 Henry 
 Ben

 Henry 
++
If you want to know how many different drivers there are, use COUNT(DISTINCT):
mysql> SELECT COUNT(DISTINCT name) FROM driver_log;
++
 COUNT(DISTINCT name) 
++

3 
++
COUNT(DISTINCT) ignores NULL values. If you also want to count NULL as one of the values
in the set if it's present, do this:
COUNT(DISTINCT val) + IF(COUNT(IF(val IS NULL,1,NULL))=0,0,1)
The same effect can be achieved using either of the following expressions:
COUNT(DISTINCT val) + IF(SUM(ISNULL(val))=0,0,1)
COUNT(DISTINCT val) + (SUM(ISNULL(val))!=0)
COUNT(DISTINCT) is available as of MySQL 3.23.2. Prior to that, you have to use some kind
of workaround based on counting the number of rows in a SELECT DISTINCT query. One way
to do this is to select the distinct values into another table, then use COUNT(*) to count the
number of rows in that table.
DISTINCT queries often are useful in conjunction with aggregate functions to obtain a more
complete characterization of your data. For example, applying COUNT(*) to a customer table
indicates how many customers you have, using DISTINCT on the state values in the table
tells you which states you have customers in, and COUNT(DISTINCT) on the state values
tells you how many states your customer base represents.
When used with multiple columns, DISTINCT shows the different combinations of values in
the columns and COUNT(DISTINCT) counts the number of combinations. The following
queries show the different sender/recipient pairs in the mail table, and how many such pairs
there are:
mysql> SELECT DISTINCT srcuser, dstuser FROM mail
> ORDER BY srcuser, dstuser;
+++
 srcuser  dstuser 
+++
 barb
 barb

 barb
 tricia 
 gene
 barb

 gene
 gene

 gene
 tricia 
 phil
 barb

 phil
 phil

 phil
 tricia 
 tricia  gene

 tricia  phil

+++
mysql> SELECT COUNT(DISTINCT srcuser, dstuser) FROM mail;
++
 COUNT(DISTINCT srcuser, dstuser) 
++

10 
++
DISTINCT works with expressions, too, not just column values. To determine the number of
hours of the day during which messages in the mail were sent, count the distinct HOUR( )
values:
mysql> SELECT COUNT(DISTINCT HOUR(t)) FROM mail;
++
 COUNT(DISTINCT HOUR(t)) 
++

12 
++
To find out which hours those were, list them:
mysql> SELECT DISTINCT HOUR(t) FROM mail ORDER BY 1;
++
 HOUR(t) 
++

7 

8 

9 

10 

11 

12 

13 

14 

15 

17 

22 

23 
++
Note that this query doesn't tell you how many messages were sent each hour. That's covered
in Recipe 7.16.
7.6 Finding Values Associated with Minimum and Maximum Values
7.6.1 Problem
You want to know the values for other columns in the row containing the minimum or
maximum value.
7.6.2 Solution
Use two queries and a SQL variable. Or use the "MAXCONCAT trick." Or use a join.
7.6.3 Discussion
MIN( ) and MAX( ) find the endpoints of a range of values, but sometimes when finding a
minimum or maximum value, you're also interested in other values from the row in which the
value occurs. For example, you can find the largest state population like this:
mysql> SELECT MAX(pop) FROM states;
++
 MAX(pop) 
++
 29760021 
++
But that doesn't show you which state has this population. The obvious way to try to get that
information is like this:
mysql> SELECT name, MAX(pop) FROM states WHERE pop = MAX(pop);
ERROR 1111 at line 1: Invalid use of group function
Probably everyone attempts something like that sooner or later, but it doesn't work, because
aggregate functions like MIN( ) and MAX( ) cannot be used in WHERE clauses. The intent of
the statement is to determine which record has the maximum population value, then display
the associated state name. The problem is that while you and I know perfectly well what we'd
mean by writing such a thing, it makes no sense at all to MySQL. The query fails because
MySQL uses the WHERE clause to determine which records to select, but it knows the value of
an aggregate function only after selecting the records from which the function's value is
determined! So, in a sense, the statement is selfcontradictory. You could solve this problem
using a subselect, except that MySQL won't have those until Version 4.1. Meanwhile, you can
use a twostage approach involving one query that selects the maximum size into a SQL
variable, and another that refers to the variable in its WHERE clause:
mysql> SELECT @max := MAX(pop) FROM states;
mysql> SELECT @max AS 'highest population', name FROM states WHERE pop =
@max;
+++
 highest population  name

+++
 29760021
 California 
+++
This technique also works even if the minimum or maximum value itself isn't actually
contained in the row, but is only derived from it. If you want to know the length of the
shortest verse in the King James Version, that's easy to find:
mysql> SELECT MIN(LENGTH(vtext)) FROM kjv;
++
 MIN(LENGTH(vtext)) 
++

11 
++
If you want to ask "What verse is that?," do this instead:
mysql> SELECT @min := MIN(LENGTH(vtext)) FROM kjv;
mysql> SELECT bname, cnum, vnum, vtext FROM kjv WHERE LENGTH(vtext) = @min;
+++++
 bname  cnum  vnum  vtext

+++++
 John 
11 
35  Jesus wept. 
+++++
Another technique you can use for finding values associated with minima or maxima is found
in the MySQL Reference Manual, where it's called the "MAXCONCAT trick." It's pretty
gruesome, but can be useful if your version of MySQL precedes the introduction of SQL
variables. The technique involves appending a column to the summary column using CONCAT(
), finding the maximum of the resulting values using MAX( ), and extracting the nonsummarized part of the value from the result. For example, to find the name of the state with
the largest population, you can select the maximum combined value of the pop and name
columns, then extract the name part from it. It's easiest to see how this works by proceeding
in stages. First, determine the maximum population value to find out how wide it is:
mysql> SELECT MAX(pop) FROM states;
++
 MAX(pop) 
++
 29760021 
++
That's eight characters. It's important to know this, because each column within the combined
populationplusname values should occur at a fixed position so that the state name can be
extracted reliably later. (By padding the pop column to a length of eight, the name values will
all begin at the ninth character.)
However, we must be careful how we pad the populations. The values produced by CONCAT(
) are strings, so the populationplusname values will be treated as such by MAX( ) for
sorting purposes. If we left justify the pop values by padding them on the right with RPAD( ),
we'll get combined values like the following:
mysql> SELECT CONCAT(RPAD(pop,8,' '),name) FROM states;
++
 CONCAT(RPAD(pop,8,' '),name) 
++
 4040587 Alabama

 550043 Alaska

 3665228 Arizona

 2350725 Arkansas

...
Those values will sort lexically. That's okay for finding the largest of a set of string values with
MAX( ). But pop values are numbers, so we want the values in numeric order. To make the
lexical ordering correspond to the numeric ordering, we must right justify the population
values by padding on the left with LPAD( ):
mysql> SELECT CONCAT(LPAD(pop,8,' '),name) FROM states;
++
 CONCAT(LPAD(pop,8,' '),name) 
++
 4040587Alabama


550043Alaska

 3665228Arizona

 2350725Arkansas

...
Next, use the CONCAT( ) expression with MAX( ) to find the value with the largest
population part:
mysql> SELECT MAX(CONCAT(LPAD(pop,8,' '),name)) FROM states;
++
 MAX(CONCAT(LPAD(pop,8,' '),name)) 
++
 29760021California

++
To obtain the final result (the state name associated with the maximum population), extract
from the maximum combined value the substring that begins with the ninth character:
mysql> SELECT SUBSTRING(MAX(CONCAT(LPAD(pop,8,' '),name)),9) FROM states;
++
 SUBSTRING(MAX(CONCAT(LPAD(pop,8,' '),name)),9) 
++
 California

++
Clearly, using a SQL variable to hold an intermediate result is much easier. In this case, it's
also more efficient because it avoids the overhead for concatenating column values for sorting
and decomposing the result for display.
Yet another way to select other columns from rows containing a minimum or maximum value
is to use a join. Select the value into another table, then join it to the original table to select
the row that matches the value. To find the record for the state with the highest population,
use a join like this:
mysql> CREATE TEMPORARY TABLE t
> SELECT MAX(pop) as maxpop FROM states;
mysql> SELECT states.* FROM states, t WHERE states.pop = t.maxpop;
+++++
 name
 abbrev  statehood  pop

+++++
 California  CA
 18500909  29760021 
+++++
7.6.4 See Also
For more information about joins, see Chapter 12.
7.7 Controlling String Case Sensitivity for MIN( ) and MAX( )
7.7.1 Problem
MIN( ) and MAX( ) select strings in case sensitive fashion when you don't want them to, or
vice versa.
7.7.2 Solution
Alter the case sensitivity of the strings.
7.7.3 Discussion
When applied to string values, MIN( ) and MAX( ) produce results determined according to
lexical sorting rules. One factor in string sorting is case sensitivity, so MIN( ) and MAX( ) are
affected by that as well. In Chapter 6, we used a textblob_val table containing two
columns of apparently identical values:
mysql> SELECT tstr, bstr FROM textblob_val;
+++
 tstr  bstr 
+++
 aaa  aaa 
 AAA  AAA 
 bbb  bbb 
 BBB  BBB 
+++
However, although the values look the same, they don't behave the same. bstr is a BLOB
column and is case sensitive. tstr, a TEXT column, is not. As a result, MIN( ) and MAX( )
will not necessarily produce the same results for the two columns:
mysql> SELECT MIN(tstr), MIN(bstr) FROM textblob_val;
+++
 MIN(tstr)  MIN(bstr) 
+++
 aaa
 AAA

+++
To make tstr case sensitive, use BINARY:
mysql> SELECT MIN(BINARY tstr) FROM textblob_val;
++
 MIN(BINARY tstr) 
++
 AAA

++
To make bstr not case sensitive, you can convert the values to a given lettercase:
mysql> SELECT MIN(LOWER(bstr)) FROM textblob_val;
++
 MIN(LOWER(bstr)) 
++
 aaa

++
Unfortunately, doing so also changes the displayed value. If that's an issue, use this technique
instead (and note that it may yield a somewhat different result):
mysql> SELECT @min := MIN(LOWER(bstr)) FROM textblob_val;
mysql> SELECT bstr FROM textblob_val WHERE LOWER(bstr) = @min;
++
 bstr 
++
 aaa 
 AAA 
++
7.8 Dividing a Summary into Subgroups
7.8.1 Problem