As you probably know, data durability is one of the hardest things to guarantee in databases. There are many databases that claim to be ACID, but in reality are not, just check the Jepsen test results .
Although S1Search, the database technology engine we developed is not ACID compliant, but eventually consistent, SlicingDice as the whole platform, that includes all the components we use to provide our service, guarantees all the ACID properties for your databases.
Although speed is really important for a database dealing with a huge amount of data like we do, nothing is more important than data integrity and consistency. It doesn’t really matter how fast the database can answer queries if the data is missing or corrupted and the query is answered with the wrong numbers.
Actually, for a service like SlicingDice and an Analytics Database as S1Search, where the main purpose of queries is to analyze the data, wrong or incomplete answers can lead to wrong business decisions, which can end up being really expensive and damaging.
Every time you send an insertion request to your database, our platform (API) receives it and immediately sends it to one of our Kafka clusters. Our platform will hold the insertion request confirmation until we are able to confirm that your insertion request was correctly stored on at least three nodes (3 replicas) from at least two of our Kafka clusters, one from the same datacenter that received the insertion request and another cluster on a remote/different datacenter.
As you can check on the SlicingDice infrastructure page, we currently have three completely independent data centers from different providers in different countries that operate simultaneously in a high-availability configuration. That means that two data centers can fail and our service will continue to support data insertion and querying.
Because of that, in order for SlicingDice to lose your data, all our three data centers would need to be completely offline.
Once your insertion was correctly inserted on one of the S1Search nodes, your data is automatically replicated to more two nodes, located on the other two datacenters.
As you can check in the data backup page, we constantly perform remote backups of all data stored on SlicingDice, so in a event of major hardware failures affecting all our datacenters, we can still recover our customer's data.
Unfortunately, data and database corruption are very common while moving or modifying it, for all types of databases and technology providers. But this is not acceptable for us.
SlicingDice’s and S1Search’s code coverage are higher than 98% and we take it very seriously in our development process. However we all know that unit tests and code coverage won’t be enough to get rid of integrity and consistency issues, specially considering a complex, parallel, concurrent and multithreaded system like a database.
We had to had a way of feeling confident on what we were building, so we decided to take a radical approach: build a database testing framework to be used as the source of truth when validating our system.
Remember that S1Search was build to perform analytical queries, so we don’t know in advance what our customer’s queries will look like. For example:
- How many columns they would use in a query.
- What combination of column types they would use in a same query.
- What if they try to make multiple boolean operation on top of multiple time-series columns, also combining non-time-series columns, how the system would behave.
So we decided to build a database testing framework, that is basically a simpler and lighter version of the S1Search database that could generate testing data and also store them for comparison purposes.
This database testing framework works like this:
You define the types of columns you want to test, how many different values you want to be inserted (whether they will be really used in queries or just be there to stress the system) and finally for how many Entity IDs you want this generated data to be inserted to.
For each type of column you defined, the database testing framework will first generate all the data and send it to be inserted on S1Search, also storing for itself a copy of the generated data for further comparison purposes.
Once the all the data was completely inserted on S1Search, the framework will then automatically generate all the possible combinations of supported queries based on the columns you have declared to be tested.
These queries will then be issued to S1Search and the obtained results compared to the expected results based on the data stored on the test database.
In order to S1Search version be declared ready for production, it must be tested with all the existing column types and supported query operations. If a single query fails with a difference of even a single ID, we reprove the version until we correct it.
Some numbers of this testing database framework:
Consider this test configuration below:
- Entity IDs: 1,000
- Matched Values: 1,000
- Garbage Values: 1,000
- Column Types: All (we currently have 11 column types)
- Query Types: All (we currently have 11 query types)
- Days: 7 (distributing the generated data in 7 different days, as this affects the time-series queries)
The test configuration above results in:
- 3,646,986 unique insertion messages sent to S1Search (520,998 messages per day)
- 45,696 unique queries, each expecting a different result (6,528 queries per day)
Here is an example of the test output after running insertion for 1 day:
========== Insertion Statistics ========== INFO: Quantity of insertion commands: 520998 INFO: Quantity of columns inserted: 4164994 INFO: Quantity of columns per type: string_test_column: 440000 time_series_decimal_test_column: 494998 time_series_string_test_column_2: 16000 boolean_test_column: 456000 decimal_not_overwrite_test_column: 4000 time_series_decimal_test_column_2: 16000 time_series_numeric_test_column: 494998 bitmap_test_column: 120000 numeric_not_overwrite_test_column: 4000 numeric_test_column: 482000 string_not_overwrite_test_column: 4000 time_series_string_test_column: 464998 decimal_test_column: 258000 range_test_column: 456000 uniqueid_test_column: 208000 date_not_overwrite_test_column: 4000 date_test_column: 222000 time_series_numeric_test_column_2: 16000 bitmap_not_overwrite_test_column: 4000
We insert data and run queries for multiple days because between them we also test other things that could affect consistency, such as: restarting the server, moving shards between nodes, killing the process unsafely (kill -9) and so on.
We believe creating the database testing framework was one of the best technical decisions we ever had. Although it took a long time to build it and make it stable and trustworthy, it saved us hundreds of hours during the development and, more important, gave us the necessary confidence we need to create a platform like SlicingDice.