About SlicingDice Technology
Learn more about which technologies power SlicingDice, and how it handles your data.
SlicingDice uses a few third-party services in its operations. These services and their purposes are listed below.
- API - Python-based API developed to handle all SlicingDice's orchestration upon request.
- S1Search - The home-grown analytics database that stores and queries all the data.
- Apache Kafka - Message broker and buffer for all data received by the API.
- Apache Zookeeper - Distributed tool used for reliably managing shared configurations and storing their metadata.
- MySQL - Relational database that we use to store basic customer information, database and columns created and user permissions and access groups.
- Aerospike - Key-Value NoSQL database used to store API information that needs fast responses. Also used to cache all queries performed on SlicingDice and the results of queries created on our saved queries API endpoint.
- Backblaze B2 - External storage provider that it's used to backup all SlicingDice data.
Network and monitoring
- CloudFlare - Used to DNS, Load Balance and DDoS protection.
- Pingdom - Used to test and monitor API and internal services availability.
- Sentry - Used to automatic error reports.
- PagerDuty - Used to test and monitor internal services and dispatch support engineers.
- Stripe - Used to handle all customer billing information and processing.
As SlicingDice developed its own database technology from the ground up, all data is stored in its own hash and encoded binary format, making it harder for any non-authorized access to the original format of the stored data.
From an infrastructure perspective, SlicingDice strictly follows recommended approaches in server hardening and sensible information management. One example of such recommended approach is that it does not allow any SSH access to our servers. All servers are accessed exclusively using KVM over IP provided by its infrastructure partners.
It also stores all data on external object store providers, such as Backblaze B2. In case all the servers go down, It still keeps all of your data safely stored.
SlicingDice is compliant with the highest security standards and regulations, like the following.
Infrastructure and Redundancy
SlicingDice uses several infrastructure providers, such as OVH, Hetzner, Amazon Web Services, Alibaba and Microsoft Azure.
As it uses bare metal dedicated servers for cost and performance reasons, server nodes can fail at any time. This means that it's absolutely necessary for SlicingDice's operations to have data redundancy.
SlicingDice currently achieves a high level of redundancy and availability by:
- Replicating its customer's data accross 3 different datacenters (or availability zones);
- Making hourly backups and storing it on local backup server;
- Storing a full daily copy of its backed up data on remote backup providers, such as Backblaze B2 Cloud Storage.
Besides having all these redundancy measures, SlicingDice also constantly perform unexpected actions and shutdowns on our production environment, similar to the Netflix Chaos Monkey approach, in order to test the resiliency of its services.
Data durability is one of the hardest things to guarantee in databases. There are many databases that claim to be ACID, but in reality are not.
Wrong or incomplete query answers can lead to wrong business decisions, which can end up being really expensive and damaging. Because of that, SlicingDice adopts several measures in order to assure the data durability
Every time you send an insertion request to your database, SlicingDice's platform (API) receives it and immediately sends it to one of our Kafka clusters. The platform will hold the insertion request confirmation until it is able to confirm that your insertion request was correctly stored on at least three nodes (3 replicas) from at least two of the Kafka clusters, one from the same datacenter/availability zone that received the insertion request and another cluster on a remote/different datacenter/availability zone.
SlicingDice currently has several independent data centers from different providers, in different countries and different availability zones, that operate simultaneously in a high-availability configuration. That means that two data centers or availability zones can fail and the service will continue to support data insertion and querying.
Once your insertion was correctly inserted on one of the S1Search nodes, your data is automatically replicated to another two nodes, located on the other two datacenters or availability zones.
Additionally, SlicingDice constantly performs remote backups of all data stored on it, so in a event of major hardware failures affecting all its datacenters, it is still able recover all data.
Unfortunately, data and database corruption are very common while moving or modifying it, for all types of databases and technology providers. But this is not acceptable for SlicingDice.
SlicingDice's data durability testing framework
The code coverage for SlicingDice and S1Search is higher than 98% and it is taken very seriously in the development process. To reach 98%, SlicingDice development team has taken a radical approach: build a database testing framework to be used as the source of truth when validating its system.
S1Search was built to perform analytical queries, so the team didn't know in advance what the users queries would look like. For example:
- How many columns they would use in a query;
- What combination of column types they would use in a same query;
- What if they try to make multiple boolean operation on top of multiple time-series columns, also combining non-time-series columns, how the system would behave.
So the team decided to build a database testing framework, that is basically a simpler and lighter version of the S1Search database that could generate testing data and also store them for comparison purposes.
The database testing framework works like this:
- Define the types of columns to test, how many different values to be inserted (whether they will be really used in queries or just be there to stress the system) and finally for how many Entity IDs this generated data will be inserted to.
- For each type of column defined, the database testing framework will first generate all the data and send it to be inserted on S1Search, also storing for itself a copy of the generated data for further comparison purposes.
- Once the all the data was completely inserted on S1Search, the framework then automatically generate all the possible combinations of supported queries based on the columns declared previously.
- These queries will then be issued to S1Search and the obtained results compared to the expected results based on the data stored on the test database.
- In order to the S1Search version be declared ready for production, it had to be tested with all the existing column types and supported query operations. If a single query failed with a difference of even a single ID, the version is rejected until correction.
Numbers of the testing framework. Test configurations:
- Entity IDs: 1,000
- Matched Values: 1,000
- Garbage Values: 1,000
- Column Types: All
- Query Types: All
- Days: 7 (distributing the generated data in 7 different days, as this affects the time-series queries)
- 3,646,986 unique insertion messages sent to S1Search (520,998 messages per day)
- 45,696 unique queries, each expecting a different result (6,528 queries per day)
========== Insertion Statistics ========== INFO: Quantity of insertion commands: 520998 INFO: Quantity of columns inserted: 4164994 INFO: Quantity of columns per type: string_test_column: 440000 time_series_decimal_test_column: 494998 time_series_string_test_column_2: 16000 boolean_test_column: 456000 decimal_not_overwrite_test_column: 4000 time_series_decimal_test_column_2: 16000 time_series_numeric_test_column: 494998 bitmap_test_column: 120000 numeric_not_overwrite_test_column: 4000 numeric_test_column: 482000 string_not_overwrite_test_column: 4000 time_series_string_test_column: 464998 decimal_test_column: 258000 range_test_column: 456000 uniqueid_test_column: 208000 date_not_overwrite_test_column: 4000 date_test_column: 222000 time_series_numeric_test_column_2: 16000 bitmap_not_overwrite_test_column: 4000
The team inserted data and ran queries for multiple days and in between also tested other things that could affect consistency, such as: restarting the server, moving shards between nodes, killing the process unsafely ( kill -9) and so on.