Friday, May 30, 2008

Optimizing a large MySQL table that gets a lot of inserts

When you develop a web application that is database driven that is going to log a lot of data, you really have to consider the database structure carefully.

Here's an example of how a log table was optimized. This table had over 60 million rows and getting another million rows a day!

--

The problem with this table (in terms of speed) is that

It’s quite large (62M rows)
It’s heavily indexed.
Large PK (Primary Key) and fields (VARCHARS vs CHAR/INT)

The first one will make searching the data expensive and the second one will make inserting data also expensive. So basically its all bad in terms of read writes.

Having a large primary key might be an issue with Innodb. Also since all the data is an MD5, using CHAR(32) would provide faster lookups.

Things that can be done:

Partition the data to smaller tables like ‘log_01_08’, ‘log_02_08’, etc. And have your application hit the appropriate table when pulling reports.

Create summary tables with Triggers where a trigger will update a table upon an insert. This way if you need to pull stats on the table, you don’t have to query the huge table but the summary table. So on any insert in log table, update summary table with new counter if hits (for example). Of course this depends on the type reports that you’re pulling off this large table.

Trim the indexes to the ones you only need. Indexing all the fields in a table will make INSERTS (and restoring) a really expensive operation, as it has to update the index as well. You can also use multi-field indexes. If you have a query like:

SELECT hashid FROM logs WHERE hashid = ‘xxxx’ AND querydate > ‘date’

Then consider creating an index like (hashid,queryid) and not 2 separate indexes. Trim your indexes as much as you can. If these cannot be converted to numeric hashes, then consider only indexing the first 12-16 characters and not all 32 chars. This assumes that the first 12 chars are unique enough. Of course this does not apply to the primary key.

I see that your hash your fields, maybe consider using a numeric function to hash your text like CRC32. The advantage of this is that in your columns you would have numeric values and not alphanumeric like you have right now (MD5). This not only translates to a smaller filesystem footprint but a much faster index search.

0 Comments:

Post a Comment

<< Home