User:OrenBochman/Efficent SQL
This is an assignment for Simone's adoption program. You are welcome to edit this page if you notice any errors or have any additional information to add, but as a courtesy, please notify OrenBochman if you make any major changes to avoid any possible confusion between him and his adoptee(s). Thanks! |
Coding with efficent SQL
editThe tutorial will cover:
- Why Optimization Matters?
- Why Indices are most important?
- How to avoid Unindexed and Unlimited Queries?
- Using EXPLAIN.
The slide deck for this tutorial has specific examples of optimized queries and simple practices that you can use to speed up your queries.
- Simple Prep You Need in order to Take this Tutorial
For the practice exercise, you will access a database with sample data in labs. All you need to access it is a labs account and membership in the 'bastion' project (all users who are members of any project are also members of the 'bastion' project). You don't need to be a member of the tutorial project. Before the tutorial, we suggest that you be sure you can access this database. You need to ssh into bastion (ssh bastion.wmflabs.org) and, once you're in, run
mysql -h tutorial-mysql -u tutorial commonswiki_partial
You may also want to suggest a query to be used in the demo, via the list below.
Introduction
editTo many MediaWiki developers, SQL query performance and optimization is shrouded in mystery. Most know that there are efficient and inefficient queries, and that if they write an inefficient query, it will either be noticed during code review, or it will be noticed because it takes down a wiki, which will prompt an ops person to fix the breakage and yell at the developer who caused it. But few people seem to really understand how query performance works.
In this unit we will answer the following questions:
- How can you tell if a query is inefficient?
- How do you write efficient queries, and avoid inefficient ones?
- What are the best pracatices for SQL that developers should be aware of?
This tutorial will cover the basics of how database engines in general, and MySQL specifically, execute different kinds of queries, and explain why certain queries are executed more efficiently than others and what role indexes play in this process. We will demonstrate better practices by writing efficient queries, and showing you how to use tables and indexes so they facilitate efficient queries, and discuss common pitfalls that result in inefficient queries and how to address them. We will also demonstrate how to obtain a query analysis from MySQL and how to make sense of it.
Concept 1 Indexes
editIndexes are Pre-sorted lists of rows and are the key to speeding up queries.
|
|
SELECT * FROM page
ORDER BY page_len LIMIT 5;
by using an index in our table we can the cost of avoid sorting the table which amounts to saving N*Log(N), where N is the number of items in the the table. For english wikipedia we have about N=4,000,000 pages which would cost about 26 million operations.
Concept 2 Limit
editLimiting a query allows the database to return the result earlier without calculating the results for the whole table. This can be done in two ways.
- Limit the number or rows returned using (LIMIT)
- Limit the number or rows scanned by making a smarter query, e.g. by using an indexed field.
SELECT * FROM page
ORDER BY page_touched LIMIT 5;
|
|
When the table is index it is very efficent. Another advantage of limited queries can be seen in the query bellow which does not require the table to be sorted.
SELECT * FROM page
WHERE page_is_new=0 LIMIT 5;
If the table has an index then the following two select queries works as a binary search.
SELECT * FROM page
WHERE page_id = 94109 LIMIT 1;
SELECT * FROM page
WHERE page_id >= 94109
ORDER BY page_id LIMIT 3;
Concept 3 Indexing using two fields
edit
|
|
Recoomendations
edit- Think about how the Database will run your query.
- Ask how can I limit scaned rows ?
- Ask how can I limit returned rows ?
- What can I precalculate via an Index, Summary table, Counter table, etc...?
- Add indexes where needed.
- Use batch queries (when it makes sense).
- In some cases, denormalize for performance.
- Add information duplicated from other tables.
- Summary tables, counter tables, cache tables, etc.
- Avoid running unindexed queries.
- caveat: WHERE on rarely false conditions usually OK
- Unindexed ORDER BY (filesort) never OK
- ORDER BY expression --> filesort == bad
- QAvoid Page with OFFSET (or LIMIT 50,50)
- LIMIT 10 OFFSET 5000 scans 5010 rows
- Use WHERE foo_id >= 12345 instead
- Avoid using COUNT(), SUM(), GROUP BY, etc...
- These do not limit rows scanned.
- MAX()/MIN() of indexed field on entire table is OK.
Suggest a Query to be Used in the Demo
editBelow, add a query you want to see optimized in the tutorial Demo (list query suggestions here):
- ContributionScores query
SELECT user_id,
user_name,
user_real_name,
page_count,
rev_count,
page_count+SQRT(rev_count-page_count)*2 AS wiki_rank
FROM `bw_user` u
JOIN (
(
SELECT rev_user,
COUNT(DISTINCT rev_page) AS page_count,
COUNT(rev_id) AS rev_count
FROM `bw_revision`
WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
GROUP BY rev_user
ORDER BY page_count DESC LIMIT 50
)UNION(
SELECT rev_user,
COUNT(DISTINCT rev_page) AS page_count,
COUNT(rev_id) AS rev_count
FROM `bw_revision`
WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
GROUP BY rev_user
ORDER BY rev_count DESC LIMIT 50
)
) s ON (user_id=rev_user) ORDER BY wiki_rank DESC LIMIT 50;
MW:Extension:ContributionScores polls the wiki database to locate contributors with the highest contribution volume - this has NOT been tested on a high-volume wiki. The extension is intended for fledgling Wikis looking to add a fun metric for Contributors to see how much they are helping out.
It is used at translatewiki.net and occasionally causes (very) slow queries there. Example query:
# Time: 120525 6:51:18
# User@Host: twn[twn] @ localhost []
# Query_time: 28.124669 Lock_time: 0.000105 Rows_sent: 50 Rows_examined: 18860849
SELECT user_id,
user_name,
user_real_name,
page_count,
rev_count,
page_count+SQRT(rev_count-page_count)*2 AS wiki_rank
FROM `bw_user` u
JOIN (
(
SELECT rev_user,
COUNT(DISTINCT rev_page) AS page_count,
COUNT(rev_id) AS rev_count
FROM `bw_revision`
WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
GROUP BY rev_user
ORDER BY page_count DESC
LIMIT 50
) UNION (
SELECT rev_user,
COUNT(DISTINCT rev_page) AS page_count,
COUNT(rev_id) AS rev_count
FROM `bw_revision`
WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
GROUP BY rev_user
ORDER BY rev_count DESC
LIMIT 50
)
) s ON (user_id=rev_user)
ORDER BY wiki_rank DESC LIMIT 50;
- Uses
2. Batch query vs. many queries Don't have an example off the top of my head, but this may be a less obvious optimisation with potentially big rewards.
Discussion
editIf there are any questions you have about this lesson, ask them! My job, as your adopter, is to help you with any problem you may have. If you don't have any questions that you need to ask, your next step is to take a short test regarding this lesson. If you are ready to take the test, simply tell me (either on this page or on my talk page) and I will hand it out to you.