MatrixOne, a promising HTAP database for real-time data warehouseAuthor | Yang Tian Yang Tian is a Deputy Chief Engineer of Shenzhen Urban Transport Planning Center, a leading transportation science company in China. His current technical expertise focuses on Data Algorithms Platform implementations, and Cloud Native transformation. He has been involved in IT architecture consulting projects for top logistic enterprises in China. I work as the chief data and algorithm platform architect in Shenzhen Urban Transport Planning Center. The company was established in 1996 as a national high-tech enterprise in the field of urban traffic research. Based in Shenzhen, and provides urban transport solution services to 29 provinces and more than 100 cities in Mainland China, Hong Kong, and Macao. The company analyzes more than 750 million records of data daily to provide integrated solutions and the most advanced services and products for government, enterprises, and the public. With its work, the company has already helped to advance mobility for over 200 million people. In this article, I'll mainly discuss about three subjects:
Evolution of Our Data WarehousingThe transportation of the modern age, from ride-hailing apps to autonomous vehicles are built on big data. By its very nature, intelligent transportation applications require to capture massive amounts of data. From the identification of vehicles entering highways, the real-time positions of city buses and taxies, the number of passengers entering and exiting transit platforms, data is helping to solve the most complex challenges faced by the public and private transportation sectors in such giant nation and mega cities. In our key projects, our evolution of data warehousing has been through 3 phases. At first, our data warehouse was built upon traditional Oracle solution. It was powerful enough for handling pure transactional data and a limited data size (under 10 TB level). But as we are introducing more semi-structured data sources such as GPS tracing logs, traffic volume sensors, we quickly reached the processing limit of Oracle due to a scaling limitation, and the end users can not accept the oracle's query performance, sometimes the query lasted for hours. Therefore, we've introduced a Hadoop-based batch-processing architecture as a second phase. With the help of CDH (Cloudera Distributed Hadoop), we were able to process PB level data set. Some massive batch processing up to one thousand tables computing was possible. We also built many machine learning algorithms upon this batch data warehouse, to know about the Traffic Performance Index , Traffic Emission Model etc. However, new emerging ITS (intelligent transportation system) projects started to demand for real-time analytics, as traffic management department require to know in real-time about the traffic congestion status and make real-time decisions. We enter the third phase by building a complete Lambda architecture. With a streaming processing layer with Kafka and Clickhouse, the real-time GPS and sensor data are consumed and aggregated in near real-time. The Traffic Performance Indexes gets refreshed every five seconds. The real-time analytics has brought us an unprecedented user experience and satisfaction. Our ideal real-time data warehouseThe current Lambda architecture has been running for over two years. However, as ITS evolves from a road-based system to a whole domain-based system(road-human-vehicle-map), some pain points are impeding our focus on business logic.
We did quite some investigation on the modern data stack, we find ourselves in a dilemma. An upgrade from the current architecture would probably require to introduce more data components and we have to manage on the application side even more data fragmentation. Then we realized what we truly need is a one-stop real-time data warehouse product instead of drowning ourselves in the data stack pool. With limited talents working on big data, we want a simple and performant product instead of an in-house handicraft, more specifically it should meet following requirements:
MatrixOne Testing SummaryWe came across the MatrixOne project in the Github trending about a year ago. They declared themselves being a hyper-converged database and having a better performance than Clickhouse. A very intriguing point for us is that they could merge transactional, analytical and streaming in a single database and promises best performance. We find it appealing to be a good fit for our next generation real-time data warehouse. I've watched the project growing for a year. Until the time I write this article, MatrixOne is under its 0.6 version development. From their published articles and technical talks, we find out three key points of MatrixOne which could benefit us:
To understand better about how MatrixOne performs, we've made some tests with its develop version and have a comparison with Clickhouse and MySQL. Our test machine is a virtual machine with CentOS 8.4 and 16 Core CPU + 64GB Memory for hardware. The versions we tested are MatrixOne 0.6.0, Clickhouse 22.11.1, MySQL 8.0.30, all three are installed as a standalone instance. The test data set is a GPS data of 140GB with 23 tables.
We deployed a simple Java web application with MySQL and MatrixOne. MatrixOne has limited support with Spring JPA framework support, with some little community help on the configuration, we were able to connect. The basic CRUDs work fine with MatrixOne, apart from some lack of builtin functions, it seems equal with MySQL. We didn't modify a single line of application code, so MatrixOne gets a positive point for CRUD support and MySQL compatibility.
We prepared 5 simple queries, for aggregation, point query, range query, order by and distinct with a table of 80 million rows. Clickhouse is faster, but MatrixOne is decent as well. | Database Product | Q1(aggregation) | Q2(point query) | Q3(range query) | Q4(Order by) | Q5(distinct) | | ---------------- | --------------- | --------------- | --------------- | ------------ | ------------ | | Clickhouse | 0.005s | 0.497s | 0.071s | 0.164s | 0.04s | | MatrixOne | 0.042s | 0.21s | 0.134s | 0.25s | 0.15s |
This test is mainly between Clickhouse and MatrixOne, we have prepared 8 queries which are real use cases of key transportation application, with a lot of subqueries, joins and aggregations. The test results are relatively at the same level, MatrixOne has for sure more room to improve. | Database Product | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | | ---------------- | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ----- | | Clickhouse | 0.078s | 0.712s | 0.196s | 1.207s | 1.655s | 1.124s | 0.054s | 1.27s | | MatrixOne | 0.2s | 7.56s | 0.23s | 1.41s | 0.71s | 0.25s | 0.14s | 1.9s |
We insert 10 millions rows by 10 batches into Clickhouse, MySQL and MatrixOne. MatrixOne doesn't reveal advantage in data ingestion, but theoretically a distributed setting will allow insert performance scale out with CN number increases. | Database Product | 10 millions rows insert time | | ---------------- | ---------------------------- | | Clickhouse | 53s | | MatrixOne | 133s | | MySQL | 124s | As a summary, MatrixOne showed its basic capacity for a HTAP database, but due to lack of documentations, we weren't able to test the distributed version of MatrixOne, its scaling and concurrency capacity. In a real-time data warehouse use case, MatrixOne has pros such as good performance, update/delete support, ACID guarantee, and its compatibility with MySQL syntax, all these would make our life much easier. But to be honest, as a early stage project, MatrixOne still has a long way to be a real product. Many features are waiting to be implemented: deamon service, online schema change, support of Apache Kafka and Flink, and more importantly, a consistent storage layout to avoid losing everything after upgrade. There is also much room to improve for MatrixOne's performance, as their chief engineers told me: they haven't had much time in performance tuning yet. We'll keep watching the progress of MatrixOne, and hopefully upgrade our real-time data warehouse with this state-of-art HTAP solution. |