[leetcode/db] 595 big countries

Write a SQL solution to output big countries’ name, population and area.

Problem

There is a table World

+-----------------+------------+------------+--------------+---------------+
| name            | continent  | area       | population   | gdp           |
+-----------------+------------+------------+--------------+---------------+
| Afghanistan     | Asia       | 652230     | 25500100     | 20343000      |
| Albania         | Europe     | 28748      | 2831741      | 12960000      |
| Algeria         | Africa     | 2381741    | 37100000     | 188681000     |
| Andorra         | Europe     | 468        | 78115        | 3712000       |
| Angola          | Africa     | 1246700    | 20609294     | 100990000     |
+-----------------+------------+------------+--------------+---------------+

A country is big if it has an area of bigger than 3 million square km or a population of more than 25 million.
Write a SQL solution to output big countries' name, population and area.
For example, according to the above table, we should output:

+--------------+-------------+--------------+
| name         | population  | area         |
+--------------+-------------+--------------+
| Afghanistan  | 25500100    | 652230       |
| Algeria      | 37100000    | 2381741      |
+--------------+-------------+--------------+

Solution

This is quite an easy one. We need to select the rows which meet either of the two requirements. The keyword OR or UNION coud be used to solve the problem.
Solution 1: OR

SELECT name, population, area
FROM World
WHERE area > 3000000 OR population > 25000000
3

Solution 2: UNION

SELECT name, population, area
FROM World
WHERE area > 3000000

UNION

SELECT name, population, area
FROM World
WHERE population > 25000000

In this case, UNION is faster than OR. The difference is caused by the use of index, which we will discuss below in details.

Index

First, lets look at the definition for index

“An index is an on-disk structure associated with a table or view that speeds retrieval of rows from the table or view. An index contains keys built from one or more columns in the table or view. These keys are stored in a structure (B-tree) that enables SQL Server to find the row or rows associated with the key values quickly and efficiently.”

Usually, the use of index can boost the performance of query:

“When performing a table scan, the query optimizer reads all the rows in the table, and extracts the rows that meet the criteria of the query. When the query optimizer uses an index, it searches the index key columns, finds the storage location of the rows needed and extracts the matching rows from that location.Generally, searching the index is much faster than searching the table because unlike a table, an index frequently contains very few columns per row and the rows are in sorted order.”

Clustered Index vs. Nonclustered Index

We can classify indexes from different angles:

  • By storage type: Clustered Index, Nonclustered Index
  • By uniqueness: Unique Index, Normal Index
  • By number of keys: Single-column Index, Multi-column Index

Here we only focus on comparing Clustered Index and Nonclustered Index:

Clustered Index

  • Sort and store the data rows in the table or view based on their key values(columns included in the index definition).
  • There can be only one clustered index per table, because the data rows themselves can be stored in only one order.
  • When a table has a clustered index, the table is called a clustered table. If it doesn’t, its data rows are stored in an unordered structure called a heap.

Nonclustered Index

  • Have a structure separate from the data rows. A nonclustered index composed of a key value and a pointer(row locator) to the data row that contains the key value.
  • The structure of the row locator depends on whether the data pages are stored in a heap or a clustered table. For a heap, a row locator is a pointer to the row. For a clustered table, the row locator is the clustered index key.
  • There can be only multiple nonclustered index per table.

Conclusion

Using UNION is faster when it comes to cases like this, when we need to scan two different columns.
Given that MySQL usually uses one index per table in a given query:
When using OR, it only uses the 1st index rather than 2nd index, it would still have to do a table-scan to find rows that fit the 2nd index.
When using UNION, each sub-query can use the index of its search, then the sub-queries are combined.

The performance statistics can be found in here:

Scenario 4: Selecting Clustered index columns for different fields
            CPU       Reads        Duration       Row Counts
OR           0         319           366           1228
UNION        0          50           193           1228

Reference

1. Union and OR and the Explanation
2. Whether to use UNION or OR in SQL Server Queries
3. Clustered and Nonclustered Indexes Described
4. MySQL: index and key (Chinese)