Sunday, September 19, 2010

2010 Oracle Open World - MYSQL Sunday

Just come back from MYSQL Sunday sessions of 2010 Oracle Open World. Share some thoughts while memory is still fresh, the two Facebook sessions are pretty good. Harrison shared some insight of how Facebook using MYSQL replication from operation perspective. Mark's session is similar to the one he did in MYSQL Conf this April.

Global Transaction IDs
  We recently experienced some network issue between our Data Center and Cloud provider. Finding the right binlog position for slaves in cloud after network issue some time can be tricky. And in many cases, the relylog is totally toasted, we need to change master to previous position based on exec_master_log_pos. The thing is we can't guarantee the integrity of slave data because some corrupted statement could get applied. In most case, the slave SQL thread stopped because it reads invalid statement from corrupted relylog. Who's to say there's no corrupted 'valid' statement applied?
  Well, we are still considered lucky in this case as master is operational and all binlogs are in good shape. Image if Master crashes and we need to prompt one slave to new master and sync other slaves to the new master. Theoretically we need to find a slave that closest to master and prompt it to master and then sync the rest slaves. It’s easy saying than to be done in real world crashes. Because MYSQL didn’t provide an easy mapping between master BINLOG position and slave Rely LOG position.  On top of that, each slave maintain its own rely log that means the rely log number and position can’t be shared globally among slaves. This makes the tasks to sync all slaves with new master particularly difficult if not impossible. What Facebook and Google MYSQL team was using is a Global Transaction ID (similar to SCN in Oracle), they use this Global Transaction ID to map to rely log and binlog position, this makes slave master synchronization a lot easier. Unfortunately, this Global Transaction ID only exists in Mysql Google patch and Facebook patch. Not in main line MySQL.

DRBD Master 
   The second thought is derived from first, switch between master and slaves is not an easy task especially with multi-slaves setup. It’s particularly difficult when master crashes instead of graceful shutdown. Even with Global Transaction ID it’s still involve manual work and prolonged downtime. So some users actually setup protected masters using DRBD. http://www.drbd.org/  DRBD is a pair of DB in a cluster that has block to block mirroring. If case of crashes, the other one can quickly take over with minimum data loss.
Thus we got master protected and guaranteed. We can then build slaves against the cluster. Greatly reduced the need to synchronize slaves as master pretty much crash free.
Of course this is not solution for everything, by eliminating the need to switch between master and slave it also loss some advantage of master slave rotation. For example, we can’t run expensive DDL like ALTER TABLE on slaves and switch to master and do the same DDL on former master. Because in this case, slave is not in the same league as master.

Seconds behind master
   I think most of you already noticed this is not a reliable indicator to determine if slave is running behind master. The problem is because this indicator is not a heartbeat indicator. It’s only updated when there’s update events received by slaves. So Slave will happily register 0 seconds behind master even after it’s not getting any event for a long time. In some case it might be legitimate that there’s really no event on master and a lot of time it’s because we have network issue between master and slave, the connection simply got dropped. The solution is update a heartbeat table and check the difference manually. MySQL 5.5 has built in Heartbeat mechanism which definitely a big plus for MYSQL replication.

Tuesday, September 7, 2010

log-bin.~rec~ in BINLOG directory

Anyone noticed there's a log-bin.~rec~ file under MySQL BinLog directory?
I didn't notice this file until recently our rsync script that syncing 
Binlog to backup server failed with this error.


building file list ... rsync: link_stat 
"/DB/MYPROD/arch/MPROD_arch.~rec~" failed: 
No such file or directory (2) rsync error: 
some files could not be transferred (code 23) at main.c(892)
[sender=2.6.8]


A little research revealed that this is actually a by product of MySQL bug fix.
It's a temporary register file MySQL used to record BinLog files before adding or purging them.
It only exists briefly in the BinLOG dir, our rsync script was lucky enough to catch it and failed to sync it before it's gone.

http://bugs.mysql.com/bug.php?id=45292

To fix these issues, we record the files to be purged or created before really removing or adding them. So if a failure happens such records can be used to automatically remove dangling files.
The new steps might be outlined as follows:
(purge routine - sql/log.cc - MYSQL_BIN_LOG::purge_logs)
1 - register the files to be removed in the log-bin.~rec~ placed in the data directory.
2 - update the log-bin.index.
3 - flush the log-bin.index.
4 - delete the log-bin.~rec~.
(create routine - sql/log.cc - MYSQL_BIN_LOG::open)
1 - register the file to be created in the log-bin.~rec~ placed in the data directory.
2 - open the new log-bin.
3 - update the log-bin.index.
4 - delete the log-bin.~rec~.
(recovery routine - sql/log.cc - MYSQL_BIN_LOG::open_index_file)
1 - open the log-bin.index.
2 - open the log-bin.~rec~.
3 - for each file in log-bin.~rec~.
3.1 Check if the file is in the log-bin.index and if so ignore it.
3.2 Otherwise, delete it.
The third issue can be described as follows.
The purge operation was allowing to remove a file in use thus leading to the loss of data and possible inconsistencies between the master and slave. Roughly, the routine was only taking into account the dump threads and so if a slave was not connect the file might be delete even though it was in use.