Wednesday, September 6, 2017

Write-heavy workloads with MyRocks

MyRocks is based on RocksDB and RocksDB is write-optimized, so why don't write-heavy workloads always run faster on MyRocks than on InnoDB? I will start with a summary of MyRocks versus InnoDB: MyRocks has better space & write efficiency, frequently has better write latency and sometimes has better read latency. Better space & write efficiency means you use less SSD and it will last longer.

Don't forget that better write and space efficiency with MyRocks can lead to better read latency. When a database engine does fewer writes to storage there is more IO capacity available for reads which is a big deal with disk and can be a big deal with slower SSD. When a database engine uses less space in storage then it is likely to cache more data in RAM and have better cache hit rates on reads. Many of the benchmarks that I run use uniform distribution for key generation and won't show the benefit from better cache hit rates.

With RocksDB write usually means a call to Put and Put is a blind-write. As shown by many benchmark results, a blind write can be very fast with RocksDB - insert data into memtable, optionally flush the WAL to the OS page cache, optionally force the WAL to persistent storage. But SQL update and insert statements usually need much more than a blind-write and the reads done by MyRocks can explain why some write-heavy workloads are faster with InnoDB. Things that get in the way include:
  • Pluggable storage engine APIs are slow to adopt blind-write optimizations
  • The modified row count must be returned from an update statement 
  • Enforcement of the PK and unique constraints
  • Secondary index maintenance
Pluggable storage engine APIs are slow to adopt blind-write optimizations. I don't blame MySQL and MongoDB for this because such optimizations are limited to write-optimized engines. But MyRocks and MongoRocks are here and I expect that WiredTiger/MongoDB will eventually have an LSM in production. M*Rocks engines can use the merge operator for this. I don't expect blind-write optimizations to ever be implemented for an update-in-place b-tree.

The modified row count must be returned for an update statement and that requires evaluation of the where clause. For RocksDB this requires reads -- from the memtable, maybe from the LSM tree, OS page cache and storage. Reads from storage and the OS page cache might require decompression. This is a lot more work than a blind-write. The usage of blind-write optimizations will result in statements that are update-like and insert-like because the application programmers must be aware of the semantic differences -- modified row count won't be returned, constraint violations won't be acknowledged. I think it is worth doing.

Unique constraints must be enforced for the PK and unique secondary indexes. With the exception of an auto-increment column, this requires a read from the index to confirm the value does not exist. In the best case updates & inserts are in key order and the structures to be searched are cached in RAM -- otherwise this requires reads from the OS page cache and/or storage and might require decompression.

Secondary indexes must be maintained as part of the update & insert statement processing. For an insert this means that a new index entry will be inserted. For an update that requires index maintenance the existing index entry will be deleted and a new one will be inserted. With a b-tree the leaf page(s) will be read (possibly from storage), made dirty in the buffer pool and eventually those pages will be written back to storage. With RocksDB non-unique secondary index maintenance is read-free and blind-writes are done to put the new index entry and possibly delete-mark the old index entry. See the previous paragraph if there is a unique secondary index.

No comments:

Post a Comment