User:OrenBochman/Efficent SQL

Coding with efficent SQL

The tutorial will cover:

Why Optimization Matters?
Why Indices are most important?
How to avoid Unindexed and Unlimited Queries?
Using EXPLAIN.

The slide deck for this tutorial has specific examples of optimized queries and simple practices that you can use to speed up your queries.

Simple Prep You Need in order to Take this Tutorial

For the practice exercise, you will access a database with sample data in labs. All you need to access it is a labs account and membership in the 'bastion' project (all users who are members of any project are also members of the 'bastion' project). You don't need to be a member of the tutorial project. Before the tutorial, we suggest that you be sure you can access this database. You need to ssh into bastion (ssh bastion.wmflabs.org) and, once you're in, run

mysql -h tutorial-mysql -u tutorial commonswiki_partial

You may also want to suggest a query to be used in the demo, via the list below.

Introduction

To many MediaWiki developers, SQL query performance and optimization is shrouded in mystery. Most know that there are efficient and inefficient queries, and that if they write an inefficient query, it will either be noticed during code review, or it will be noticed because it takes down a wiki, which will prompt an ops person to fix the breakage and yell at the developer who caused it. But few people seem to really understand how query performance works.

In this unit we will answer the following questions:

How can you tell if a query is inefficient?
How do you write efficient queries, and avoid inefficient ones?
What are the best pracatices for SQL that developers should be aware of?

This tutorial will cover the basics of how database engines in general, and MySQL specifically, execute different kinds of queries, and explain why certain queries are executed more efficiently than others and what role indexes play in this process. We will demonstrate better practices by writing efficient queries, and showing you how to use tables and indexes so they facilitate efficient queries, and discuss common pitfalls that result in inefficient queries and how to address them. We will also demonstrate how to obtain a query analysis from MySQL and how to make sense of it.

Concept 1 Indexes

Indexes are Pre-sorted lists of rows and are the key to speeding up queries.

page_id PRIMARY KEY
page_id	page_len
1	10775
67	1305
69	548
70	34
74	50085
77	10292
...	...

INDEX foo (page_len)
page_id	page_len
1320629	15
9609464	15
340226	16
725948	16
940255	16
1065064	16
...	...

SELECT * FROM page
ORDER BY page_len LIMIT 5;

by using an index in our table we can the cost of avoid sorting the table which amounts to saving N*Log(N), where N is the number of items in the the table. For english wikipedia we have about N=4,000,000 pages which would cost about 26 million operations.

Concept 2 Limit

Limiting a query allows the database to return the result earlier without calculating the results for the whole table. This can be done in two ways.

- Limit the number or rows returned using (LIMIT)
- Limit the number or rows scanned by making a smarter query, e.g. by using an indexed field.

SELECT * FROM page
ORDER BY page_touched LIMIT 5;

page_id PRIMARY KEY
page_id	page_touched
1	20120522185124
67	20120515122531
69	20120115072601
70	20120123124730
74	20120520174854
...	...

INDEX foo (page_len)
page_id	page_touched
10212	20050626044452
9609464	20050626044503
340226	20050626044503
725948	20050626044522
31045	20050626044523

When the table is index it is very efficent. Another advantage of limited queries can be seen in the query bellow which does not require the table to be sorted.

SELECT * FROM page
WHERE page_is_new=0 LIMIT 5;

If the table has an index then the following two select queries works as a binary search.

SELECT * FROM page
WHERE page_id = 94109 LIMIT 1;

SELECT * FROM page
WHERE page_id >= 94109
ORDER BY page_id LIMIT 3;

Concept 3 Indexing using two fields

page_id PRIMARY KEY
lastname	firstname	phone
Stevens	Daniel	526-3401
Stevenson	Amy	945-8547
Stevenson	John	324-0625
Stevenson	John	324-0625

INDEX foo (page_len)
page_namespace	page_title	page_id
0	Main_Page	3401
0	Nanobots	526
1	Main_Page	8547
2	Catrope	9

Recoomendations

Think about how the Database will run your query.
- Ask how can I limit scaned rows ?
- Ask how can I limit returned rows ?
- What can I precalculate via an Index, Summary table, Counter table, etc...?
Add indexes where needed.
Use batch queries (when it makes sense).
In some cases, denormalize for performance.
- Add information duplicated from other tables.
- Summary tables, counter tables, cache tables, etc.

Avoid running unindexed queries.
- caveat: WHERE on rarely false conditions usually OK
- Unindexed ORDER BY (filesort) never OK
ORDER BY expression --> filesort == bad
QAvoid Page with OFFSET (or LIMIT 50,50)
- LIMIT 10 OFFSET 5000 scans 5010 rows
- Use WHERE foo_id >= 12345 instead
Avoid using COUNT(), SUM(), GROUP BY, etc...
- These do not limit rows scanned.
- MAX()/MIN() of indexed field on entire table is OK.

Suggest a Query to be Used in the Demo

Below, add a query you want to see optimized in the tutorial Demo (list query suggestions here):

ContributionScores query

SELECT user_id,
   user_name,
   user_real_name,
   page_count,
   rev_count,
   page_count+SQRT(rev_count-page_count)*2 AS wiki_rank
FROM `bw_user` u
JOIN (
 (
  SELECT rev_user,
   COUNT(DISTINCT rev_page) AS page_count,
   COUNT(rev_id) AS rev_count
  FROM `bw_revision`
  WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
  GROUP BY rev_user
  ORDER BY page_count DESC LIMIT 50
)UNION(
  SELECT rev_user,
   COUNT(DISTINCT rev_page) AS page_count,
   COUNT(rev_id) AS rev_count
  FROM `bw_revision`
  WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
  GROUP BY rev_user
  ORDER BY rev_count DESC LIMIT 50
 )
) s ON (user_id=rev_user) ORDER BY wiki_rank DESC LIMIT 50;

MW:Extension:ContributionScores polls the wiki database to locate contributors with the highest contribution volume - this has NOT been tested on a high-volume wiki. The extension is intended for fledgling Wikis looking to add a fun metric for Contributors to see how much they are helping out.

It is used at translatewiki.net and occasionally causes (very) slow queries there. Example query:

# Time: 120525  6:51:18
# User@Host: twn[twn] @ localhost []
# Query_time: 28.124669  Lock_time: 0.000105 Rows_sent: 50  Rows_examined: 18860849
SELECT user_id,
   user_name,
   user_real_name,
   page_count,
   rev_count,
   page_count+SQRT(rev_count-page_count)*2 AS wiki_rank
FROM `bw_user` u
JOIN (
  (
    SELECT rev_user,
     COUNT(DISTINCT rev_page) AS page_count,
     COUNT(rev_id) AS rev_count
    FROM `bw_revision`
    WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
    GROUP BY rev_user
    ORDER BY page_count DESC
    LIMIT 50
  ) UNION (
    SELECT rev_user,
      COUNT(DISTINCT rev_page) AS page_count,
      COUNT(rev_id) AS rev_count
    FROM `bw_revision`
    WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
    GROUP BY rev_user
    ORDER BY rev_count DESC
    LIMIT 50
  )
) s ON (user_id=rev_user)
ORDER BY wiki_rank DESC LIMIT 50;

Uses

2. Batch query vs. many queries Don't have an example off the top of my head, but this may be a less obvious optimisation with potentially big rewards.

Discussion

If there are any questions you have about this lesson, ask them! My job, as your adopter, is to help you with any problem you may have. If you don't have any questions that you need to ask, your next step is to take a short test regarding this lesson. If you are ready to take the test, simply tell me (either on this page or on my talk page) and I will hand it out to you.