DatabaseApr 19th, 2024How to Use Spark to Write Batch Data into MatrixOneApache Spark is a distributed computing engine designed for efficient processing of large-scale data. It employs distributed parallel computing, distributing tasks of data splitting, computing, and merging across multiple computers, thus achieving efficient data processing and analysis.DatabaseApr 19th, 2024
DatabaseApr 12th, 2024Advanced Shuffle Optimization Techniques — Join reorderThe execution plan for shuffle is a very important part of the optimizer. Due to space limitations, only some key aspects have been introduced here. For more details and related implementation code, feel free to directly check the MatrixOne source code.DatabaseApr 12th, 2024
DatabaseMar 31st, 2024Advanced Shuffle Optimization Techniques:How do we handle uneven data?Previously, the tpch dataset was always used for examples, but tpch represents an ideal scenario where all data are evenly distributed. In actual production environments, many datasets are unevenly distributed. For uneven data, a straightforward approach is to use a hash shuffle to ensure that the data is evenly distributed after bucketing.DatabaseMar 31st, 2024
DatabaseMar 19th, 2024Introduction to Colocate ShuffleAlthough the execution plan of shuffle on multiple CNs can reduce the overhead of hash tables, it may also increase the cost of data transmission over the network. A good shuffle execution plan must minimize data transmission over the network as much as possible. Therefore, the introduction of colocate shuffle optimization is crucial for performance.DatabaseMar 19th, 2024
DatabaseMar 8th, 2024Why is shuffle support needed?The basic principle of the shuffle algorithm is to bucket the input data, where each bucket's data can be processed independently and output results. This reduces the size of the hash table when aggregating data.DatabaseMar 8th, 2024
DatabaseMar 1st, 2024Quickly Start: MatrixOne Database Kernel Front-EndWe will start with a simple introduction of the structure of a database kernel.DatabaseMar 1st, 2024
DatabaseFeb 2nd, 2024MatrixOne LogtailIn simple terms, the main content of the Logtail protocol is divided into two parts:in-memory data and metadata. The core difference lies in whether the data has been transferred to object storage.DatabaseFeb 2nd, 2024
DatabaseJan 26th, 2024Optimizing Log Management in Distributed Systems: Advanced Log Backend and Transaction Handling MechanismsThe initial Log Backend was based on a local file system. To accommodate distributed characteristics, we developed a highly reliable and low-latency Log Service as the new Log Backend. We abstracted a virtual backend to adapt to different log backends, connecting to various backends through the development of lightweight drivers.DatabaseJan 26th, 2024
DatabaseJan 25th, 2024What is WAL? and How to Apply it?(Commit Pipeline&Checkpoint)The Write Ahead Log (WAL) is a technology related to the atomicity and durability of databases.It functions by converting random writes into sequential read-writes during transaction commits. Changes in transactions occur randomly across various pages, which are scattered.DatabaseJan 25th, 2024
DatabaseJan 18th, 2024What is the Binder in Database Kernels: A useful guide might helps you.A database is a "warehouse that organizes, stores, and manages data according to data structure." It is a collection of large amounts of data that are organized, shareable, and uniformly managed, and stored long-term in a computer.DatabaseJan 18th, 2024
DatabaseMar 31st, 2024Advanced Shuffle Optimization Techniques:How do we handle uneven data?DatabaseMar 31st, 2024
DatabaseJan 25th, 2024What is WAL? and How to Apply it?(Commit Pipeline&Checkpoint)DatabaseJan 25th, 2024