If you want to start spinning the Flywheel of your machine learning systems, you can’t be locked with an Oracle database or a Teradata data warehouse.
In this chapter, we will discuss the total shift that you have to have about your data to be able to be successful with using the unique data that your business is generating to boost your future business dramatically. In previous chapters, we discussed the nature of the Flywheel and examine a few examples from Amazon experience of thinking big about your customers’ problems and ways to use machine learning and AI to solve these problems magically.
The nature of the Flywheel is that once it is spinning it is getting faster and faster, but the downsides of it are the long time and the massive energy and wisdom you need to spend to get it spinning beyond the tipping point. There are many ways to get it wrong and the various terms to the new era of data after the simpler database, such as data warehouse, data lake, data swamps, data ocean, can indicate that complexity.
The name I prefer is data mine. This name gives both sides of the data, including the risks and the gains. I visited last week the small town of Roslyn on my trip to the northwest of the US. Today this city is famous for the filming location of “Northern Exposure” and “The Man in the High Castle,” but more than 100 years ago it was famous for its coal mines and a disaster that claimed the life of 45 miners (the above image is from the memorial for the lost miners). Large economies were based on these mines, but also many people paid with their lives or health to get the coal/gold/iron and other resources from it. You need to dig in the right places, have the right tools, and to have the right people to make your fortune from data mines.
Where is my data?
There are many reasons why Oracle was a great company with a great set of products. They took the concept of the database (DB) to reality in hundreds of thousands of businesses. Relational Database Systems (RDBMS) were not the first and only DB option before Oracle took over the market. I still remember the days when hierarchical DB such as ADABAS were more popular, and other commercial options were available such as Sybase or DB2. Nevertheless, Oracle was the dominant player in the DB market for many years, and almost every IT systems have an RDBMS as its data store and most of them were using an Oracle DB (or MS-SQL).
Life was comfortable when all the data was in a single place (Oracle), had a single schema, and was accessible with a standard method (SQL). Why change if it is not broken?
The First Crack: Transactional vs. Analytics
The first step to get the most of the data is to realize that you need different format of data for transactional workloads (moving funds between bank accounts, for example) and analytical workloads (what is the credit score of each customer, for example). This difference was the first crack in the RDBMS fortress. The problems we wanted to solve with data systems were not only to balance our bank accounts with transactions and two-phase-commits. Many issues need to scan historical data and enrich with context from other sources. For example, if we want to calculate a fraud risk value for every transaction, we can’t just look at each transaction, and the machine learning models that we want to build to make these predictions, must use millions of previous transactions to learn from. Of course, you can store the historical records in your Oracle based system, and during the years Oracle purchased and developed hundreds of plugins to its DB to support these advanced workloads. A wide set of experts emerged such as system analysts, who are designing the schema of the data, database administrators (DBA), who can optimize the DB and make sure that it is working correctly, and data analysts, who are able to write SQL data queries to generate business intelligence (BI) reports and dashboards.
Nevertheless, with the explosion of data, it is clear now that you can’t have all the data in one schema and one form.
“no one size fits all in data.”
If you have data that needs random access (a specific bank account to change, for example), it will require completely different storage than if you need to scan over billion of records quickly. Your Oracle can do them both, but it will be optimized for neither. The good old times that all the data resides in a single place with a single schema and a single query interface are not giving a good foundation to the modern requirements of more scalable systems (Internet Scale), more analytical systems and mainly the new machine learning systems. The “solution” of creating a dedicated single data warehouse for analytical workloads, either from Oracle, Microsoft or Teradata is only the first step toward using your data better.
The Second Crack: SQL vs. NoSQL
Even you start with the traditional transactional usage of the database, RDBMS can fail in scale. One of my more complex projects as a solutions architect for Amazon was to get Amazon teams away from Oracle. This decision and plan was not small revenge against a competitor in the cloud business or even a way to save money on the expensive licenses of Oracle databases. It was merely the fact that Oracle couldn’t scale to the size of Amazon business needs.
Some of you might feel the technical problems that Amazon had at the beginning of the last Prime Day sales last week, and you might even remember other issues in the websites, mobile applications, Alexa devices or AWS services in the past. All these issues were investigated deeply by the teams, trying to learn how to prevent them in the future. This process is called “Correction of Error” or COE. These investigations are painful, not just because of the large business impact they had, but also on the technical details and depth that are needed to gather in the process. The trigger for a COE is any “Order Drop,” which means that fewer people were placing orders than expected due to a technical problem from any of the thousands of services that serve each order. For a company that sold 178 billion US dollars last year, every minute of downtime is more than 1/3 million dollars lost sales. And since these issues are not happening in average time, but on peak time, it can be a few millions of US dollars lost sales per minute. The COE report is always a fascinated read for system architects, as you can learn how sensitive are these systems, and many good designs can look great on the whiteboard, but fail in real-life test. From many COE reports, one resource stood out significantly:
“In the root for almost every COE, there is an Oracle DB.”
It is not fair to blame Oracle for every technical problem that happened in Amazon, as it was used in almost every service in Amazon with a high degree of complexity and scale. Even after Oracle was replaced with other DB technologies, still “Order Drops” are happening, as we experienced in the recent Prime Day (this time it was Sable, Amazon internal DB technology, that had issues). Even when Oracle has been used the issues were not mainly in the DB software, but “out of disk space”, “schema change”, “hardware failure” or “slow query time”, and these issues could be fixed or prevented with some better design of the database or the clients calling it. Nevertheless, it was decided to migrate all tier one services (services that can cause an “Order Drop”) away from Oracle and RDBMS in general. It was clear that since most of the queries were with “… WHERE user-id = …” in their statement, they can benefit from the more scalable technology of NoSQL, and use GET or PUT instead of SELECT.
This change might sound like a small change, but it required the redesign of hundreds of systems, reeducate hundreds of DBAs, rewrite thousands of queries and lines of codes, migrate billions of records, and to do that without a single second of downtime of the services. Since Amazon was the largest customer of Oracle, it also had the most extensive set of Oracle DBAs, which is one of the most highly paid professions in IT. Now, you need to take the best experts of RDBMS technology and Oracle internal specifically and ask them to learn a new technology that is entirely different than what they spend so many years mastering. “What is the schema?”, “Where are the transactions?”, “How do you SELECT data?”, “How can you JOIN?” and many dramatic differences between these two technologies. Most of these systems were not new and not designed with NoSQL in mind. The data systems had a lot of existing functionality, code, and data that had to be modified in the process. Some of these DBAs decided not to change, and they moved to other companies that required a high skill of Oracle and were ready to pay a lot to get the experts from Amazon. But many decided to learn these new technologies and to become Data Engineers (DE) instead of Database Administrators (DBA).
This migration project was long and involved hundreds of experts from various disciplines. Many technologies and services were developed during this time to support it. For example, Amazon DynamoDB, the main NoSQL services from Amazon Web Services (AWS) added critical features such as Backup and Restore, Point-in-time-recovery, Secondary Indices, Built-in Cache (DAX), Encryption-at-rest, etc. Amazon Aurora was also developed in AWS to allow some of the services to remain in RDBMS and using SQL but in a more cost-effective scalable way.
The Third Crack: Structured vs. Unstructured data
NoSQL as transactional DB is not a mandatory requirement for building machine learning systems. Most businesses don’t have scale requirements for their transactions that demand NoSQL databases. MySQL is still the most popular DB in SMB and is always recommended for most systems. Nevertheless, the first NoSQL DB in the list above of favorite DB products is MongoDB, which indicates the critical part of the data for machine learning, namely unstructured data.
Of course, you can store everything inside your Oracle DB in various forms such as VARCHAR for short documents, BLOB (Binary Large Object) or CLOB (for character-based large objects). But please don’t do that. Each type of data has its optimized tool. MongoDB can handle JSON objects and files, ElasticSearch is optimized for textual documents, Redis for various data types (sorted lists, for example), or Parquet files in cloud storage for scanning over a significant amount of data.
Unstructured data have different forms in a data mine. It can be the audio recording of the call centers, or the video streams from the various video cameras in stores, GPS records of delivery trucks, pictures that people are taking with their mobile devices, textual emails that are sent internally and externally, streams of events such as purchase history, click stream in a website, keystrokes and many others. Each one of these data resources can provide essential insights on the business, and once it is mined and used in machine learning models, it can boost the business operations and revenues. In future chapters, we will discuss how to process each one of these data resources, but first, we need to capture it, store it, secure it, and provide access to it for the data scientists.
At this point, you might give up and say that the added complexity of so many different data formats isn’t worth the effort, especially early in the machine learning journey, when you don’t know what do to with the data. To make it simpler to handle all these data resources, let’s make the following observation:
“Data is an immutable log.”
The raw format of all the data types above is of a log. Some of them are easily identified as a log, such as a video or audio file. Video file is a sequence of frames, and an audio file is a sequence of audio sampling. You have control over the quality and size of the log (frame sampling rate or frame size and resolution, for example), and the metadata that you can add to each file (timestamp, camera location, caller ID, etc.). It is less evident that even transactional database can be turned into a log. Oracle DB can generate a changelog, and this changelog is used for backup and restore, data replication, point-in-time-recovery, audits and other administrative tasks. Nevertheless, complex transactions cannot be completed without such log functionality, as they often require retries or reconciliation. Some databases took this observation to be the core of the DB system. One of these databases is an internal DB used in Amazon for these complex transactions. For example, launching a cluster of EC2 machines with compute instances, EBS volumes, Security Groups, VPC subnets and other configurations is a complex transaction that can’t be completed reliably using a standard RDBMS. In this DB, you commit into the log, and then you can “apply” the changes into the standard DB records (such as MySQL or Oracle) for queries.
Update: AWS released this internal database and it is now publicly available as Amazon QLDB.
You don’t need to go to the extreme of treating all your data as a log, but once you understand that, you can be more open to capture, store, secure and then mine your neglected data in the form of unstructured logs. These logs can be system logs, web click-stream logs, video and audio streams and files, and other historical sequences of events all through your business. Use any event capturing system (Kafka, Kinesis, sidecar agents, loggers), any low-cost cloud storage (Amazon S3, Google Cloud Storage, Microsoft Azure Storage…), and slowly add ways to analyze them. You can also capture the data and store it, without mining it yet. Two years from now, you will have much more data, and you will gain more experience with mining this data. Remember that without that data mine, you can’t do any advanced machine learning.
The Forth Crack: Monolithic vs. Micro-Services
Different scales require different structures. The transition between the scales and between the organizational structures is hard, both to identify when to do it and how to do it smoothly.
- When you are starting small, it is better to have a single team, building a single system, using a single database.
- When you are starting to add functionality, it makes sense to add it to the same existing system and unified data schema.
- When the system becomes too complicated to handle as a single system, it is time to break down the system into micro-services, built by independent teams with separated data stores.
- When even the integration between the micro-services and distributed data stores is too complicated to enable the synergy among them, a new set of concepts, such as embedding, and tools are needed to strengthen the overall power of the business.
“Everything becomes simple when you break it down.”
Amazon organization in small “two-pizza” teams is famous as one of its engines to implement innovative ideas quickly. The “two-pizza” size of about ten people was found to be big enough to allow collaboration between different experts, and small enough to allow efficient communication. “No excuses for dependencies” are allowing each of the team to run as quickly as it can without waiting on anyone around them. “I’m waiting to the other team to deliver their part…” is killing most innovations. Even Amazon was not born this way. It started as a monolithic Oracle-based system, and through many trials and errors, the current structure emerged. I can also safely predict that this is not the end state of Amazon, and more organizational evolution will happen soon to allow it to continue to grow and integrate its various micro-services two pizza teams further.
Every company is in a different phase in their journey, and the majority of the companies that I meet are facing the hard decision of breaking down their databases and IT systems. They already feel the inability to innovate and grow at the pace that other companies (mainly startup ones) are developing. Like every significant transition, it is easy to make a mistake and decide on a wrong path. The two main errors (from my experience and point of view) are to build a unified data warehouse (Teradata or Amazon Redshift) or to upgrade to a new version of Oracle (or MS-SQL). The transition in this phase should be a combination of organizational change (“two-pizza” independent DevOps teams managing API services), IT systems to a micro-services structure (Docker, Lambda Kubernetes…), and finally, break down the database into data fragments each optimized for its format.
The Fifth Crack: Static (CR) vs. Agile
The next barrier in a large organization to the needed breakdown, is the risk of changes to the unified monolithic system and its schema. Most such organizations are trying to protect against the risk of breakage that every change introduce. I spent a year at HP that had that time one release a year for their product. I saw from inside how slowly a large organization can move if this is the release cycle, especially, when I compare it to the crazy pace that I saw at Amazon and other cloud customers. It was probably the way to do software updates a decade ago, but today when many of the competitors are breaking down the yearly release to dozens of smaller version release every a week or two, you can’t wait three months to the approval of the DB schema change that you need for a new feature. You also can’t wait that extended period for an emergency roll-back or other changes that are often required for the fast pace of the market today.
“Release fast and often.”
The breakdown of the schema that we discussed in the previous sections into the smaller independent schema is now getting another level of breaking down, one of the multiple versions. The data logs of a few months ago are different from the data logs that you have now, and a few months in the future. But since the log files should be immutable, the complexity is how can I query different schema version in a single pass. The good news is that these are relatively easy technical problems. You can run a single query over a large table with formats such as Apache Iceberg. When you target building machine learning models with the data, it is also easy. Most machine learning models can handle missing values, and some of these models are sometimes dropping out parts of the data to make sure that they are not over-fitting and can generalize better on the training data.
Conclusion
In this chapter, we discussed the leap that many people need to do from thinking about data in relational databases (RDBMS) to thinking about data in mines. Data is no longer in a single location, a unified format, and can be retrieved in a single statement. The scale of the data, the availability of new tools, the progress of the technology, and the expectations of the customers, are forcing a different way to model your data, and no Oracle DB or Teradata Warehouse can support it. They can be small parts of your data solution, but only that. We discussed the significant cracks in the Oracle fortress that push many companies to break free, from the new data analytics workloads, the emerging NoSQL scalable lookup DBs, the rise of the unstructured data as text, image, video or speech, the micro-services, containers, Lambda functions revolution and the move to agile development in most IT organizations.