If you like MatrixOne, give it a star on Github and join the conversation on Slack.

MatrixOne, a promising HTAP database for real-time data warehouse

Author | Yang Tian

Yang Tian is a Deputy Chief Engineer of Shenzhen Urban Transport Planning Center, a leading transportation science company in China. His current technical expertise focuses on Data Algorithms Platform implementations, and Cloud Native transformation. He has been involved in IT architecture consulting projects for top logistic enterprises in China.

I work as the chief data and algorithm platform architect in Shenzhen Urban Transport Planning Center. The company was established in 1996 as a national high-tech enterprise in the field of urban traffic research. Based in Shenzhen, and provides urban transport solution services to 29 provinces and more than 100 cities in Mainland China, Hong Kong, and Macao. The company analyzes more than 750 million records of data daily to provide integrated solutions and the most advanced services and products for government, enterprises, and the public. With its work, the company has already helped to advance mobility for over 200 million people.

In this article, I'll mainly discuss about three subjects:

  • The evolution of data warehousing in our key projects and related applications.

  • Our understanding of an ideal real-time data warehouse.

  • Some test results, benchmarks, and thoughts with MatrixOne.

Evolution of Our Data Warehousing

The transportation of the modern age, from ride-hailing apps to autonomous vehicles are built on big data. By its very nature, intelligent transportation applications require to capture massive amounts of data. From the identification of vehicles entering highways, the real-time positions of city buses and taxies, the number of passengers entering and exiting transit platforms, data is helping to solve the most complex challenges faced by the public and private transportation sectors in such giant nation and mega cities.

In our key projects, our evolution of data warehousing has been through 3 phases.

At first, our data warehouse was built upon traditional Oracle solution. It was powerful enough for handling pure transactional data and a limited data size (under 10 TB level). But as we are introducing more semi-structured data sources such as GPS tracing logs, traffic volume sensors, we quickly reached the processing limit of Oracle due to a scaling limitation, and the end users can not accept the oracle's query performance, sometimes the query lasted for hours.

Therefore, we've introduced a Hadoop-based batch-processing architecture as a second phase. With the help of CDH (Cloudera Distributed Hadoop), we were able to process PB level data set. Some massive batch processing up to one thousand tables computing was possible. We also built many machine learning algorithms upon this batch data warehouse, to know about the Traffic Performance Index , Traffic Emission Model etc. However, new emerging ITS (intelligent transportation system) projects started to demand for real-time analytics, as traffic management department require to know in real-time about the traffic congestion status and make real-time decisions.

We enter the third phase by building a complete Lambda architecture. With a streaming processing layer with Kafka and Clickhouse, the real-time GPS and sensor data are consumed and aggregated in near real-time. The Traffic Performance Indexes gets refreshed every five seconds. The real-time analytics has brought us an unprecedented user experience and satisfaction.

Our ideal real-time data warehouse

The current Lambda architecture has been running for over two years. However, as ITS evolves from a road-based system to a whole domain-based system(road-human-vehicle-map), some pain points are impeding our focus on business logic.

  • The serving concurrency is a big bottleneck. The current system works currently as a centralized service with tens of users. But we plan to publish the Traffic Performance Indexes as a data service for our partners as a data service. The concurrent users will be thousands.

  • Our current data warehouse doesn't support update and delete, due to a Clickhouse limit. To serve our dashborad application, we have to manage an addtional PostgreSQL instance to manage these transactional operations and split part of our data set to PostgreSQL. The application code needs to handle the data merge from Clickhouse and PostgreSQL.

  • The current queries only concern with coordinates data and some static data, table join is less than 5. But as a whole domain ITS,   more complex queries with 30+ table join will be ordinary. The requirement of performance needs to be equal to the current level.

  • We cannot guarantee data consistency in the current system. Every time we consume streaming data from Kafka by Spark, and insert into Clickhouse after some transformations, we would find the inserted data rows larger than the original ones. This has no harm for GPS data as it will be sampled anyway. But it will be unaccpetable for critical data as ETC payments, Vehicle and bus tracing records.

We did quite some investigation on the modern data stack, we find ourselves in a dilemma. An upgrade from the current architecture would probably require to introduce more data components and we have to manage on the application side even more data fragmentation. Then we realized what we truly need is a one-stop real-time data warehouse product instead of drowning ourselves in the data stack pool. With limited talents working on big data, we want a simple and performant product instead of an in-house handicraft, more specifically it should meet following requirements:

  • Stream Ingest and Aggregation.

  • Standard SQL and complex queries.

  • Low-Latency Queries.

  • High Concurrency.

  • Dynamic Scaling.

  • On-premise deployment.

  • Easy to maintain.

MatrixOne Testing Summary

We came across the MatrixOne project in the Github trending about a year ago. They declared themselves being a hyper-converged database and having a better performance than Clickhouse. A very intriguing point for us is that they could merge transactional, analytical and streaming in a single database and promises best performance. We find it appealing to be a good fit for our next generation real-time data warehouse.

I've watched the project growing for a year. Until the time I write this article, MatrixOne is under its 0.6 version development. From their published articles and technical talks, we find out three key points of MatrixOne which could benefit us:

  • A disaggregated compute and storage and a cloud-native architecture:

  • In general, MatrixOne is built upon S3 and Kubernetes, this leverages cloud computing advantages, delivering auto-provisioning, high availability and auto-scaling. As we understand, the computing and storage gets unlimited scaling capacity and they scale indepently. We are also big supporters of Kubernetes and microservice, our ITS application has fully migrated to a cloud native architecture.

  • A single engine which handles OLAP and OLTP:

  • Apart from a disaggregated compute and storage, MatrixOne has also a unique design of separating Transaction in a single database node layer, which allows MatrixOne to handle both OLTP and OLAP in a decent performance and with a good isolation. Our current real-time data warehouse could get Update/Delete back, and in the meantime, data consistency with ACID guarantee. This is significant for critical data processing.

  • A universal design for both public cloud and on-premise deployment:

  • The compnay mainly serve public transportation sectors, on-premise deployment is a must, regardless of what technology stack you are using. But we are also developping more private clients whose IT assets are mostly in public cloud, it would be favorable for us to use a product which allows us to deploy on-premise and keeping the possibility to scale to public cloud.

To understand better about how MatrixOne performs, we've made some tests with its develop version and have a comparison with Clickhouse and MySQL. Our test machine is a virtual machine with CentOS 8.4 and 16 Core CPU + 64GB Memory for hardware. The versions we tested are MatrixOne 0.6.0, Clickhouse 22.11.1, MySQL 8.0.30, all three are installed as a standalone instance. The test data set is a GPS data of 140GB with 23 tables.

  • CRUD functional test:

We deployed a simple Java web application with MySQL and MatrixOne. MatrixOne has limited support with Spring JPA framework support, with some little community help on the configuration, we were able to connect. The basic CRUDs work fine with MatrixOne, apart from some lack of builtin functions, it seems equal with MySQL. We didn't modify a single line of application code, so MatrixOne gets a positive point for CRUD support and MySQL compatibility.

  • Simple query performance:

We prepared 5 simple queries, for aggregation, point query, range query, order by and distinct with a table of 80 million rows. Clickhouse is faster, but MatrixOne is decent as well.

| Database Product | Q1(aggregation) | Q2(point query) | Q3(range query) | Q4(Order by) | Q5(distinct) |
| ---------------- | --------------- | --------------- | --------------- | ------------ | ------------ |
| Clickhouse       | 0.005s          | 0.497s          | 0.071s          | 0.164s       | 0.04s        |
| MatrixOne        | 0.042s          | 0.21s           | 0.134s          | 0.25s        | 0.15s        |

  • Complex query performance:

This test is mainly between Clickhouse and MatrixOne, we have prepared 8 queries which are real use cases of key transportation application, with a lot of subqueries, joins and aggregations. The test results are relatively at the same level, MatrixOne has for sure more room to improve.

| Database Product | Q1     | Q2     | Q3     | Q4     | Q5     | Q6     | Q7     | Q8    |
| ---------------- | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ----- |
| Clickhouse       | 0.078s | 0.712s | 0.196s | 1.207s | 1.655s | 1.124s | 0.054s | 1.27s |
| MatrixOne        | 0.2s   | 7.56s   | 0.23s   | 1.41s   | 0.71s   | 0.25s   | 0.14s   | 1.9s   |

  • Batch insert performance:

We insert 10 millions rows by 10 batches into Clickhouse, MySQL and MatrixOne. MatrixOne doesn't reveal advantage in data ingestion, but theoretically a distributed setting will allow insert performance scale out with CN number increases.

| Database Product | 10 millions rows insert time |
| ---------------- | ---------------------------- |
| Clickhouse       | 53s                          |
| MatrixOne        | 133s                         |
| MySQL            | 124s                         |

As a summary, MatrixOne showed its basic capacity for a HTAP database, but due to lack of documentations, we weren't able to test the distributed version of MatrixOne, its scaling and concurrency capacity. In a real-time data warehouse use case, MatrixOne has pros such as good performance, update/delete support, ACID guarantee, and its compatibility with MySQL syntax, all these would make our life much easier.

But to be honest, as a early stage project, MatrixOne still has a long way to be a real product. Many features are waiting to be implemented: deamon service, online schema change, support of Apache Kafka and Flink, and more importantly, a consistent storage layout to avoid losing everything after upgrade. There is also much room to improve for MatrixOne's performance, as their chief engineers told me: they haven't had much time in performance tuning yet. We'll keep watching the progress of MatrixOne, and hopefully upgrade our real-time data warehouse with this state-of-art HTAP solution.