MatrixOne Logtail | MatrixOrigin

The earlier TAE analysis article discussed the three primary responsibilities of TN (Transaction Node) in the system:

Handling submitted transactions,
Providing Logtail service for CN (Compute Node)
Transferring the latest transaction data to object storage and managing the log window.

Upon the completion of tasks 1 and 3, state changes occur, such as data being successfully written to memory or object storage. Logtail is a log synchronization protocol that synchronizes part of TN's state to CN in a cost-effective manner. CN acquires Logtail to locally reconstruct necessary, readable data. As a key protocol in MatrixOne's storage-compute separation architecture, Logtail has the following features:

Links and coordinates between CN and TN, essentially a log replication state machine, allowing CN to synchronize part of TN's state.
Push offers stronger real-time capabilities, continuously synchronizing incremental logs from TN to CN;
Pull supports specifying any time interval to synchronize table snapshots and can also synchronize incremental log information as needed.
Logtail supports subscription and collection at the table level, offering more flexibility in supporting multiple CNs and better balancing the load of CNs.

Logtail protocol

In simple terms, the main content of the Logtail protocol is divided into two parts: in-memory data and metadata. The core difference lies in whether the data has been transferred to object storage.

The updates generated by a transaction commit, before being transferred to object storage, are presented as in-memory data in the logtail.

Any modification to data can be categorized into two forms: insertion and deletion. For insertion, logtail information includes the row-id, commit-timestamp, and columns defined in the table; for deletion, it includes the row-id, commit-timestamp, and primary key columns.

Once such in-memory data is transferred to CN, it is organized in memory as a Btree, providing queries to the upper layer.Clearly, in-memory data cannot be retained indefinitely as it increases memory pressure. Through time or capacity limitations, in-memory data is flushed to object storage to form an object.An object consists of one or more blocks. A block is the smallest storage unit of table data, with the number of rows in a block not exceeding a fixed limit, which is currently set to a default of 8192 rows. When the flush is complete, logtail transfers the block metadata to CN. CN then filters out a visible block list based on the transaction timestamp, reads the contents of the block, and integrates this with the in-memory data to get a complete view of the data at a specific point in time.

The above is just the basic process. With the introduction of some performance optimizations, more details will emerge, such as:

Checkpoint

When TN has been running for a while and performs a checkpoint at a certain moment, all data before this moment has been transferred to object storage. Thus, these metadata are collected and condensed into a "compressed package". When a newly started CN connects to TN and fetches the first logtail, if the subscription timestamp is greater than the checkpoint timestamp, the checkpoint metadata (a string) can be transmitted via logtail. This allows CN to directly read block information generated before the checkpoint moment, avoiding the need to transfer block metadata from scratch over the network and increasing TN's IO pressure.

Memory Cleanup

When CN receives block metadata produced by TN, it cleans up previously delivered in-memory data based on the block-id. However, during the flushing of transactions on the TN side, updates to data may occur simultaneously, such as new deletions on the block being flushed. If the current strategy is to roll back and retry, the data already written becomes completely invalid, leading to numerous rollbacks under update-intensive loads, wasting TN resources. To avoid this, TN continues to commit, which means the in-memory data generated after the start of this part of the flush transaction cannot be deleted from CN. Thus, a timestamp is transmitted in the block metadata of logtail, and only the in-memory data belonging to this block before this timestamp can be cleared from memory. These uncleared updates will be asynchronously flushed in the next flush and pushed to CN for deletion.

Faster Reading

Blocks that have been transferred to object storage may continue to undergo deletions. When reading these blocks, it's necessary to integrate the deletion data from memory. To quickly identify which blocks need to be combined with in-memory data, CN maintains an additional Btree index for blocks. Careful modification of this index is needed when applying logtail: adding index entries when processing in-memory data and reducing entries when processing block metadata. Only blocks in this index are checked against in-memory data. This is particularly beneficial when there are many blocks.

There is more room for improvement in the logtail protocol:

More control information will be added to Logtail to enable more timely cleanup of outdated data in CN and control memory usage.
Although multiple blocks may be merged into one object by TN, CN currently organizes data according to a block list. In the future, metadata will be transmitted at the object level to reduce the size of the block list.

The Generation of Logtail

As previously mentioned, logtail can be acquired through two methods: pull and push. These two modes have different characteristics, which are described below.

Pull

As mentioned earlier, pull is essentially the synchronization of table snapshots and subsequent incremental log information.

To achieve this, TN maintains a txn handle list, the logtail table, sorted by transaction prepare time. Given any moment, a binary search is used to find the range of txn handles. From the txn handle, it's determined on which blocks the transaction has made updates. By traversing these blocks, a complete log can be obtained. To speed up the search, txn handles are paginated. The bornTs of a page is the minimum prepare time of the txn handle in that page. The first layer of the binary search targets these pages.

Based on the logtail table, the main workflow upon receiving a pull request is as follows:

Adjust the requested time range based on existing checkpoints; earlier ranges can be provided through checkpoints.
Take a snapshot of the logtail table and use RespBuilder in a visitor pattern to iterate over the relevant txn handles in this snapshot, collecting committed log information.
Convert the collected log information according to the logtail protocol format and return it as a response to CN.

type RespBuilder interface {
  OnDatabase(database *DBEntry) error
  OnPostDatabase(database *DBEntry) error
  OnTable(table *TableEntry) error
  OnPostTable(table *TableEntry) error
  OnPostSegment(segment *SegmentEntry) error
  OnSegment(segment *SegmentEntry) error
  OnBlock(block *BlockEntry) error
  BuildResp() (api.SyncLogTailResp, error)
  Close()
}

There are still many optimizations that can be made in this process, and here are a few examples:

If the requested time range is not the most recent, and data flushing has occurred within the right side of the requested range, how should the changes within the requested range be provided? We believe it's unnecessary to read changes from object storage again for this interval. Instead, providing the block metadata directly to CN, which can then use timestamps to filter data for the specified range, is a more efficient approach.
If there are too many in-memory changes within a time range, sending them through memory is impractical. Therefore, the response size is checked during collection; if it's too large, a forced flush to disk and recollection can be performed.
The logtail table also needs to be cleaned regularly based on checkpoints to prevent memory bloat. Additionally, data aggregation for earlier txn handles can be implemented to avoid iterating at the txn handle level each time.

Push

The main purpose of push is to synchronize incremental logs from TN to CN in real-time. The overall process is divided into three phases: subscription, collection, and pushing.

Subscription: A necessary process after a new CN starts is to establish an RPC stream as a client with the server TN, and subscribe to catalog-related tables. Only after synchronizing basic information such as database, table, and column can CN provide services. When TN receives a request to subscribe to a table, it actually goes through a pull process first, including all logtails up to the last push timestamp in the subscription response. For a CN, the subscription, unsubscription, and data sending of logtail all occur on a single RPC stream connection. If there are any exceptions, CN will enter a reconnection process until recovery. Once subscribed, subsequent logtails involve pushing incremental logs.
Collection: In TN, after a transaction completes WAL writing, a callback is triggered to collect logtail in the current transaction. The main process is to traverse TxnEntry in the workspace (a basic container for transaction updates, directly involved in the commit pipeline), and based on its type, convert the corresponding log information into the logtail protocol data format. This collection process is carried out through a pipeline, concurrently with WAL's fysync, to reduce blocking.
Pushing: The pushing phase mainly involves a filtering process. If it is found that a certain CN has not subscribed to a table, that CN is skipped to avoid unnecessary pushing.

A significant issue with collecting by transaction is what to do if a table has not been updated for a long time. How does CN know whether there have been no updates or if it simply hasn't received them? To address this, a heartbeat mechanism is implemented, typically set at 2 ms. TN's commit queue includes a heartbeat empty transaction that performs no substantial work but consumes a timestamp. This triggers a heartbeat logtail transmission, informing CN that all table data up to that point have been updated, thus prompting an update of the timestamp watermark on the CN side.

Conclusion

This has been a general introduction to logtail. Due to limited space, it's not possible to cover all the details of logtail. Logtail is a module that demands high attention to detail, thoroughly testing the protocol's applicability in both systems. Insufficient control over these details can lead to data being missed or read redundantly. Additionally, the variable of timestamps often leads to intermittent errors. In the future, we will continue to iterate on logtail, enhancing its stability, standardizing processes, and striving for greater transparency and understandability.