Sunday, February 15, 2009

Statement-based versus row-based replication

I'm no database expert and actually only recently became aware of the challenges in programming in a replicated environment. As I stated in another post, you need to insert data on the master, read from the slaves... and deal with some exceptions to this.

Anyhow, MySQL has been using up until recently a replication method called Statement-based which actually gives the whole SQL statement to the slave so that the slave can execute it to keep in sync with the master. It works well... in most cases. There are a couple of problems. Let's say you have an auto-increment table, and you're sending a bunch of insert statements to the slaves... there is *no* garantee that the statements will be executed in order on the slave. It could be multi-CPU where different inserts are done in different threads, or many other reasons... and you'll end up with the wrong incremented IDs. Also, let's say you'd like to use the NOW() function to get the current date... you might end up with a different date on all slaves for the same record. MySQL is trying to help you as much as it can: if you assign a random value to a field using the rand() function, it will also send the seed information to the slaves to that the random number ends up being the same.

We were going to implement some Oracle-style sequences to deal with auto-increment, but some hero at work made this amazing discovery: MySQL now supports row-based replication since version 5.1. What is row-based replication? Well, the idea is that the master will simply give the slave *data* information, not a statement. You will propragate the content of your modifications. This is so very cool. Most other databases work that way, but MySQL has been lagging in implementing this. Why is row-based so cool? Well, first of all, it fixes all problems mentioned above. Auto-increment table will be allright, using NOW() is OK... and also, if you have a bunch of triggers on the master, there's no need to replicate them on the slave because no trigger will get triggered since you're not executing statements anymore. Row-based will also create less table lockings and replicate faster.

I've always said: to any technique that offers loads of advantages, there's often at least one disadvantage that comes with it. Row-based has one of them: when you use a statement that will mass delete or mass insert records (insert into ... select from) it will will actually fill the replication log with one entry for each row that has been changed. So if you clear a table that has one million records... you're going to get massive replication going on there. At our company, we decided that this was an acceptable drawback, as in our system mass deletes, inserts or updates are very rare.

Row-based replication is relatively new to MySQL, you can expect many changes related to this mode in the few next releases but so far with the limited tests we've been conducting it seems to work very nicely.

No comments:

Post a Comment