An Executive Introduction to AI/ML/DL

The concepts of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) are essential to understand for every business executive and manager. The disrupting and evolving AI technology is changing every business today and many more in the near future. Here is a brief introduction to the essences you must know to better harness its power for your business and direct its usage within your organization.

The post is based on part of “Data and AI Ideation Workshop”, we (Aiola) deliver for many traditional companies to bootstrap significant AI transformations and improve business thinking about integrating AI.

Illustration of AI, ML, and DL evolution from

Since the 1950's, the efforts to use the newly invented mechanical computation technology to perform more complex tasks than arithmetic calculation pushed research, engineering, and businesses to try and realize the dream of artificial intelligence.

The dream didn’t start then.

“Any sufficiently advanced technology is indistinguishable from magic.” — The third law of Arthur C. Clarke

The original “mechanical Turk,” from Wikimedia Commons

The Mechanical Turk or Automaton Chess Player was a fake chess-playing machine constructed in the late 18th century. From 1770 until its destruction by fire in 1854, various owners exhibited it as an automaton, though it was eventually revealed to be an elaborate hoax. In fact, the Turk was a mechanical illusion that allowed a human chess master hiding inside to operate the machine. With a skilled operator, the Turk won most of the games played during its demonstrations around Europe and the Americas for nearly 84 years, playing and defeating many challengers, including statesmen such as Napoleon Bonaparte and Benjamin Franklin.

How the “magic” was done from:

The technology back in the 18th century could not allow a machine to play Chess successfully and not beat the world champion. We need to wait more than two hundred years to make it happened.

The above quote of Arthur C. Clark was his third rule. The first rule was:

“When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.” — The first law of Arthur C. Clarke.

This quote is also relevant to the development of AI over the years. In 1969, Marvin Minsky and Seymour Papert published a book about the new technology that mimics how the neural nerves in the brain are operating, called Perceptrons.

Perceptrons — Minsky and Papert 1969

The book proved mathematically some inherent limitations of the new neural network technology, which was supposed to be the AI systems’ foundations.

The boom and bust cycle of AI research from

This book and a negative UK government report, the Lighthill report, with the many limitations of computer technology and computer science at the time, led to the first AI winter. The second bubble and bubble burst were a decade later after the complexity and cost of the dedicated hardware and programming language (LISP Machines) built to create AI expert systems was estimated as not practical for commercial usage.

A powerful demonstration of AI’s capabilities today is the online game “Quick, Draw!” from Google.

Quick, Draw! Start Page from

Before you continue reading, I recommend you to try it out to see the ability of the AI powering this game to identify doodling quickly and accurately.

While you were playing the game, you also contributed more examples to the massive dataset of doodles that the AI engine uses to learn to identify the 345 categories such as “Power outlet” or “Face.”

The Quick, Draw! Dataset from

The game’s main demonstration is the magical way of the AI engine to understand unstructured data of random people doodling while they are drawing. The AI often knows what you are going to draw after the first couple of line strokes. The AI can also distinguish between very similar drawing such as “Power outlet” and “Face”:

Quick, Draw! example to distinguish between power outlet and face doodles

Another important lesson to learn from the “Quick, Draw!” game is the need for a lot of diverse data to feed the AI to learn from. The way that people are drawing objects is changing across countries, generations, and time.

Quick, Draw! example of merging different styles of drawing

For example, “Power outlet” looks different in the US and Europe. The game is also recording the user’s location and the time of drawing and using this information in its classification decision. If the model sees only US drawings, it will find it hard to understand a doodle drawn by a European user. Think about how people draw “Mask” before 2020 and how they are drawing it after the Covid-19 pandemic. AI models must be refreshed similarly to how humans are adjusting to the changing world around us.

In the evolution diagram above, we saw that in the 1980's, machine learning started to pick up speed. Computer science was more advanced compared to the science and technology we had in the ’50s. The primary leap in how computer systems were developed offered by machine learning is the difference in “who develops the program?”

Traditional Programming compared to Machine Learning

In a traditional software development process, the domain expert defines the program’s specifications, and a software developer is writing the code to take the required input and apply the business logic to get the desired output. In machine learning, the domain expert gives examples of the input data and the program’s result and lets the machine learning system generate the program to transform the input data into the desired output result.

Another essential part of the difference is the Quality Assurance (QA) of the program. In traditional software development, the program’s testing is limited to the manual work of the QA team and their imagination to come up with input and output examples. While in Machine Learning, the QA process is inherent in how the machine learning algorithm calculates the model. We will discuss the concept of minimizing the QA error of the model shortly.

The bottom line is that manual software development is hard, limited in scope, slow, and error-prone. In comparison, the machine learning way (with the right data and experience) of developing programs is scalable, fast, and accurate. However, it is hard to understand the program’s logic.

Differences between manual software development and machine learning

As you can probably guess, machine learning and deep learning are based on advanced mathematics and computer science concepts. Even if you didn’t like math in school or you found it hard to apply it IRL, we will try to describe the simple concepts that are important to understand to have better intuition as to how does it work and when:


when y is the output we want to calculate based on the input x. the parameters we need to calculate to transform the input to the output are a and b.

As much as this function looks simplistic, it captures the essence of the math behind machine learning and later deep learning. The input x is not a single number, and the parameters a and b are also not limited to single values. They are often an array or Vector of values, and sometimes are a Matrix of values or even higher dimensions called in general Tensor (Google’s TensorFlow framework name is about the flow of the Tensors from the input to the output). Nevertheless, the math that is defined as part of the Linear Algebra field is very similar. More than that, computers are very good at this kind of calculation (matrix and vector multiplications such as a*x) and can solve these computational problems quickly.

Let’s see an example of a simple machine learning model called Decision Tree, to illustrate the concepts. The example is illustrated nicely in Please visit the interactive site to see the flow of generating the decision tree.

Decision Tree Illustration from

In this example, the input x is a set of features (or attributes) that we have on each house in our data set, and the expected output is the house’s location in NYC or SF. The parameters that the simple machine learning model calculates are the values of the features (such as the house’s elevation or its price) that are splitting the houses into each city. The simple model in the diagram above is not accurate enough, and it makes a lot of mistakes in its decision or classification. This problem is called under-fitting.

We can continue to add more features to the input and more layers to the decision tree.

Deeper Decision Tree Illustration from

This deeper decision tree looks much more accurate as it does not make a single mistake on the training data. Nevertheless, this decision tree model is not as good as it is suffering from over-fitting. It learned how to fit prefectly the training data, and once it is presented with data that is different, it will not be as accurate as we expect it to be, based on the above perfect result.

A good model of machine learning can generalize from the training examples. The way to do that is to control (or regularize) the parameters that it is using, not too few and not too many.

Another entertaining example is predicting the fate of the characters of the HBO series “Game of Thrones” (GoT).

“Game of Thrones” life-or-death prediction using machine learning

A computer science class developed the machine learning model at the University of Munich. It was able to predict, relatively accurately, the fate of the different characters that millions of people followed every week in the popular series, answering the question “will they die in the next episode?”

The input (x) for the model was the following (B-Boolean (0/1), C- Category, N-Number):
1. Did appear in “A Feast for Crows” ​: B
2. Part of which House : C
3. Part of which Culture ​: C
4. Did appear in “A Dance with Dragons” : B
5. Is noble : B
6. Gender : B
7. Title : C
8. Age : N
9. Did appear in “A Storm of Swords” : B
10. Is married : B
11. Is spouse alive : B
12. Did appear in “A Clash of Kings” : B
13. Related to dead ​: B
14. Did appear in “A Game of Thrones” : B ​
15. Popularity score : N
16. Is father alive ​: B
17. Major / Minor character ​: B
18. Is mother alive ​: B
19. Is heir alive ​: B
20. Number of dead relations ​: N
21. Spouse ​: C
22. Father ​: C
23. Mother ​: C
24. Heir ​: C

If you think about each one of the features above and their effect on the chance to survive in the series, it makes a lot of sense. For example, being a female gender is a good indicator for a longer life span in GoT. However, a human developer will find it impossible to find the best combinations of these effects to write an accurate enough algorithm to make this prediction (y — dead or alive)

The machine learning process is usually starting with guessing the model’s parameters (=random values). It is then trying to use the parameters to make predictions on the training examples and evaluate the differences between the model prediction and the real value (=model error). Then, it repeats the process multiple times on the numerous examples, where each time, it is changing the initial parameters to minimize this error.

Let’s execute a simple simulation of the process.

y=ax+b is the function that we need to calculate, and for it we need to know the values of the parameters a and b.

We have a single example or a single data point: (x=5, y=30).

We will start with random values for a = 1, b = -1.

a*x+b = 1*5–1 = 4

First iteration with random values for the parameters a and b

we know that y=30 and not y=4, and we can calculate the error=|30–4|= 26.

Next, we want to minimize the error, therefore, we will increase the values of a and b to get closer to 30. We will use a = 3, b = 2.

a*x+b = 3*5+2 = 17

Second iteration with different values in the direction of smaller error

We can now calculate the new smaller error = |30–17| = 13, half of the previous error we had.

We can repeat the process more times and add more examples until we minimize the error enough. After a few iteration we stop with a=5 and b=5

Last iteration with parameters values that are giving a small enough error

This is a very simplified example that we could also solve analytically, as we know how to calculate a line’s parameters between two points. In the real-life examples of machine learning models, the number of parameters and the number of data points are much larger. We can only approximate the parameters’ values to satisfy as many of the data points examples as possible. The iterative process above can be calculated quickly and efficiently by a computer algorithm.

So far, we saw only how machine learning handled data that is numeric (age, for example), boolean (0 or 1 such as “is royalty”), or category (such as house from GoT). It is easy to assign a numeric value for each of these features and use it to build the machine learning models as we discussed above.

However, most of the physical reality around us is not falling into these simple values. We have many “unstructured data,” such as images, videos, text, speech, events, locations, and many others. As humans, we know how to understand them and incorporate them into our decision-making process. We can see a pedestrian and stop our car, we can read a restaurant review and decide if we want to go and eat there, we can listen to a joke and laugh because of it, and many other human interactions with others and the world around us. We don’t see, hear, read, touch, or smell numbers.

In the last decade, we discovered many ways to encode this complex reality for the AI to make decisions. There are many examples of processing images (face-ID in iOS, for example), speech (Alexa, Siri or Google Assistant, for example), or text (Gmail Smart Compose, for example).

Let’s take a more in-depth example of a deep learning process to recommend movies to users. Ideally, if we knew each user’s personality and the characteristics of each movie, we could easily make the match. For example, if we knew how much a user likes French movies, about crime, by Jean-Luc Besson, with a strong female character and light humor, we could predict very accurately if he would like the movie “La Femme Nikita.”

With a brief mathematical notation: S=U*M, we want to create a vector of values that is describing each user (U), and a vector of values for each movie (M), and multiply them to calculate the score of that user for that movie (S). If there is a high match between these vectors, the score will be high, and if not, the score will be low. We can take the higher scores and generate the recommendations (Max(S)).

Let’s try to explain this concept with a nice example taken from the excellent free online course on deep learning from You can run this example yourself using Microsoft Excel using this Excel file: Download.

Random values for users and movies at the beginning of the model training

When you open the Excel file, you see the above screen taking data from an extensive data set of movie scores, For each user and movie in the dataset, it is randomly creating a vector of numbers. These numerical representations of each movie and user are entirely random and have no meaning at the beginning of the process.

Next, all the movies’ vectors are multiplied by the users’ vectors to generate a score. These “predicted scores” are still completely random and don’t match the users’ actual scores to the movies. Next, the average error of all the scores (the difference between the predicted scores and the actual scores) is calculated and shown on the sheet’s bottom right. The value of the error (2.81, in this case) is the average error when trying to guess a number between 1 and 5 without any prior knowledge about the number, similar to rolling a dice with 5 faces.

Solver command in Excel toolbar

Now, we can use the Solver option in Excel to minimize the error by changing the values of the numerical representations of the users and the movies.

Calculated values at the end of the training process

After a few minutes of calculations, we can stop the training model when we see that the error is small enough. We can see that the average error now is less than 1, which means that the model now can predict the score with errors such as 4 instead of 5 or 2 instead of 3, which is a much better position to use these predicted score to make further recommendations.

We applied the simple concept of minimizing the error on a naive assumption that we can capture users’ preferences with a vector of numbers and the characteristics of movies with another vector of numbers and can predict the score each user will give to each movie by multiplying these vectors.

We have no idea what each of the parameters or values in the vectors mean. We didn’t train the first parameter to calculate ‘Action’ movies or the second for ‘French’ movies. We allowed the algorithm to start with random parameters and to change them to minimize the error.

If we try to “understand” what the AI “sees” we can look at the parameters and guess. For example, if we choose one of the parameters and check the high values of this parameter we can see the following list:

High value in one of the parameters:
[(tensor(1.1481), ‘Casablanca (1942)’),
(tensor(1.0816), ‘Chinatown (1974)’),
(tensor(1.0486), ‘Lawrence of Arabia (1962)’),
(tensor(1.0459), ‘Wrong Trousers, The (1993)’),
(tensor(1.0282), ‘Secrets & Lies (1996)’),
(tensor(1.0245), ’12 Angry Men (1957)’),
(tensor(1.0095), ‘Some Folks Call It a Sling Blade (1993)’),
(tensor(0.9874), ‘Close Shave, A (1995)’),
(tensor(0.9800), ‘Wallace & Gromit: The Best of Aardman Animation (1996)’),
(tensor(0.9791), ‘Citizen Kane (1941)’)]

If we look at the lower values, we can see the following list:

Low value in one of the parameters:
[(tensor(-1.2520), ‘Home Alone 3 (1997)’),
(tensor(-1.2118), ‘Jungle2Jungle (1997)’),
(tensor(-1.1282), ‘Stupids, The (1996)’),
(tensor(-1.1229), ‘Free Willy 3: The Rescue (1997)’),
(tensor(-1.1161), ‘Leave It to Beaver (1997)’),
(tensor(-1.0821), ‘Children of the Corn: The Gathering (1996)’),
(tensor(-1.0703), “McHale’s Navy (1997)”),
(tensor(-1.0695), ‘Bio-Dome (1996)’),
(tensor(-1.0652), ‘Batman & Robin (1997)’),
(tensor(-1.0627), ‘Cowboy Way, The (1994)’)]

Can we call this parameter as the “Classic” parameter? It can be a good way for us to “understand” what is the model that our deep learning algorithm created. However, this is still hard to use it beyond the way to generate new recommendations. We can’t explain why the model calculated any specific vector and why did it give any specific recommendation.

We can also try to see the map of movies that the model created, by taking the 50 dimension representation and map it to two dimensions:

Projection of the movie vectors to 2D map from

And we can see some movies that we can understand why they are close to one another. For example, a few of the Star Wars movies are in proximity to one another on the top right corner, or “The Terminator” movie is close to the “Return of the Jedi”, and the “Riders of the Lost Arc” (all about humanity win over the dark side). However, we can’t explain every similarity. Nevertheless, the model is advantageous to generate the movies’ recommendations, and therefore, a similar model is the core of Netflix or Amazon recommendation engines.

The main leap from the classical machine learning techniques to the newer methods of deep learning is the above concept of encoding unstructured objects, such as the users and movies in the example above. This capability opened the AI to handle many new types of data from vision, natural language, time-series events, etc.

Deep learning also simplified the need to define the features of the input to the models. For example, we don’t need to define gender as a feature to the GoT model, and we can feed the picture of the character (or the full episode video) and let the deep learning model find the features. We can try to understand how the model can do that and guess that it can predict the gender and even the age of the characters, in the same way, the data scientist defined in the example above. We can also guess that the model will build an “attractiveness” model based on these images. It is harder for the human machine-learning model builder to do that same.

The differences between traditional machine learning and deep learning

In the traditional machine learning model development, a domain expert must define the important features. A data scientist must develop the program to calculate these features from the input data, and the two can’t tell if the features are the best features for the model. While in the newer deep learning model development, the features are calculated from the raw input during the model’s training.

Deep learning offers a more powerful and robust way to train AI models on more data types with less human mistakes. The domain expert can point to the data that is useful for the model, without detailed specification of the features from the data. The end-to-end model calculates the features as part of the calculation of the parameters, which makes it more tuned and focused to the actual task of the trained model.

The downsides of deep learning are the large amounts of data needed to calculate the deep learning models’ even more parameters. Also, it is even harder to understand and explain the predictions of the trained models.

In the last 70 years, the advances of AI since the 50s made it possible to say that AI is real and not based on magician tricks. It is used extensively in many production systems, and it is entering many new systems all the time. However, we are only in the early days of AI, and the level we reached is commonly known as Weak AI or Narrow AI. The AI systems are usually excellent in a particular and narrow task. Unlike humans that can perform multiple tasks, most AI-based models can perform only a single task. For example, a self-driving car model can be an expert in measuring the distances to objects in the path of the car and identifying them. However, it can’t recommend a good driving playlist to play when riding with the kids to school.

30 years Cycle of AI development

In the future, we believe that AI will be more similar to the human capabilities of performing a wider range of decisions, which is called Strong AI. Some also believe that AI will be able to much better than humans using the collective knowledge of AI models to create a Super AI system, that can fix global problems such as Global Warming or other complex programs that require an accurate balance of billions of parameters and factors to achieve the smallest error possible.

Until we get to the future of stronger and even super AI, we can still build powerful systems based on AI models by connecting multiple weak AI models. For example, let’s take the demo from AWS evangelist Boaz Ziniman, which connects a few of Amazon AI models to a nice demonstration. The script of the demo is taken from

Example of an input image for the face recognition demo from: Cabeca de Marmore via Shutterstock

After calling the script, the model’s output that includes the faces that were highlighted, their descriptions, including their perceived age and sentiment, is spoken out in English and Dutch.

Piping multiple narrow AI models to generate wider output

The multiple AI models that are part of this pipe are each doing a single task, and they do this task very quickly and in excellent quality. The face detection, age estimation, expression or sentiment classification, the machine translation, and the text to speech models are all connected to create a powerful demonstration of what can be achieved today with a few lines of code and some fractions of cents (and millions of dollars that were spent on training the AI models that are used by the flow).

We started our story with a chess machine, the Mechanical Turk, that was discovered to be a hoax. Since then, we could build a machine that can win against the human world champion. Many people remember the iconic image of Grand Master Garry Kasparov losing to IBM Deep Blue in 1987.

Kasparov against ‘Deep Blue’ in New York, in 1997. — REUTERS / PETER MORGAN

The immediate public reaction is that machines are now better than humans in Chess, and slowly they will beat us in every aspect of intelligence.

This is a false impression. When Garry Kasparov was asked how he lost to the AI, he answered that the software could learn every game that he ever played, while he could not see a single game of the AI engine. We know that people can learn a lot from data, even if it is different from machines. Without data, even humans are less capable.

Garry Kasparov continued to think and write about his experience and thinking, and he published a few books on different topics.

Garry Kasparov Books in

Besides the Chess books, he published one book with the scary title of “Winter is coming.” Based on his experience of losing to the machine, the immediate reaction is to think that the book is about the rise of the machines in line with “The Terminator” series. However, when you check the book’s subtitle, you can see that he sees the threat from real people such as Putin, not from artificial intelligence. More than that, his other book is “Deep Thinking,” with the subtitle of “Where machine intelligence ends and human creativity begins.” His main conclusion is about the exciting opportunities in combining the AI’s amazing capabilities to operate side by side with creative humans. He claims that good AI and a good human operator can beat every AI or human separately in chess games or any other intelligent activity.

In this post, we covered fundamental concepts of Artificial Intelligence’s evolution from the days of the dreamers, through the years of the builders across two long winters, and to the domain’s explosion in recent years, where AI is entering many systems in almost every part of life.

We discussed the main concepts that make AI one of the most powerful technologies we invented, with the ability to learn and exceed human capabilities. Including capabilities in many tasks that we used to think were unique to intelligent humans’ abilities (such as evaluating characteristics of movies and preferences of viewers).

Lastly, I hope that we left you with an optimistic view of the joined future alongside the machines with AI and the desire to introduce more AI-based services to your immediate environment to help you with your work and personal life.

Guy Ernest is the co-founder and CTO of @Aiola, an AI company serving large enterprise AI transformation in the cloud. Guy is an ML-Hero of AWS.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store