Are you new to data science and machine learning? Don’t worry- we all were at some point. Here we want to help you understand what it is in its most basic, clear form.
If your background in data science isn’t super deep, hopefully this will be easy to understand for you. We’ll explain what machine learning is, the different types of it, and its common models. Don’t worry- there’ll be no math!
Machine learning happens when you input data into a computer program, choose a model for the data, and that model then allows the computer to make predictions based on the data. Computers use algorithms to make models- these can be equations or more complex systems of math.
Machine learning is called “machine” learning because once you pick a model the machine does the rest for you, learning the patterns in your data.
There are two types of machine learning- supervised and unsupervised.
Supervised machine learning is when your data in your model is labeled. Labeled means that the outcome is known. Let’s say you want to predict what type of sandwich your friend will buy from the store. You’d have variables like day of the week, maybe hunger level, etc. With supervised learning, you’d also have the outcome- the type of sandwich.Unsupervised machine learning, then, is where you don’t know the outcome of your data. The computer will find models in your patterns and try to predict the outcome, rather than already knowing it.
This model is used when you’re dealing with classification issues. The variable you are trying to predict, your target variable, has categories like yes/no or a number. A logistic regression model will use an equation to create a curve in your data, and that then predicts the outcome of a new observation.
Many beginners start with this model, as its algorithm is simple to understand. You have one x variable, and a line of best fit then makes predictions about future data. While linear regression is similar to logistic regression, this model is used when a target variable is continuous, i.e. it can be any number or value. An example of this is the selling price of a house. Linear regression’s model equation has coefficients for every variable and these point to how much the target variable changes due to changes in the independent variable.
This model can be used for classification or regression. First, the model plots the data. The “K” in this model’s title simply references the number of close data points that the model uses to determine what the predication value should be. You can choose K and change its values to see which K offers you the best predictions. Any data inside the K = circle affects what the target variable value should be for new data points, and whichever value has the most votes is what the KNN uses to predict new data points.
This model creates a boundary between data points, separating data into two classes. The machine then wants to find the boundary with the largest margin, margin being the distance between the closest point of each class and the boundary. Thus, new data points will fall into classes dependent upon which boundary side they’re on. You can use this model for both classification and regression.
Some other supervised models include decision trees and random forests, which you can learn more about here.
Here, remember, we don’t know the outcome of our data. That makes unsupervised machine learning a bit more tricky.
Here you start off by assuming you have K clusters in your data. You won’t know how many groups there are, due to not having outcome variables, so you try different K values to see which value works best. K means as a model is best when clusters are circular and similar in size. The algorithm will choose the best K data points and make these the center of the clusters. For every data point it will assign a data point to the closest cluster, and create a new center with the mean of the cluster’s data points.
Here you don’t have to input a K value and clusters can be any shape. You’ll input the minimum number of data points you want in a cluster and the radius you’re looking for, and DBSCAN does all the rest. You can change values, of course, until you get the clusters that are the best for your data set. Points that are outliers are classified here as “noise” points, which can be useful if you have some far-off points.
These models are called “neural networks” after the complex interconnections of neurons in our brains. A neural network can find patterns that the human eye can’t, making it incredibly useful for data processing. These models are at their prime with complex data such as images and audio- thus, they’re used a lot for things like Facebook’s facial recognition and text classification. They can be used with both supervised and unsupervised machine learning data.
We hope our explanation of machine learning here has helped in your understanding of data processing models and systems. Computers are incredibly powerful and we are only in the beginning stages of harnessing this power. We may not always know exactly how computers find these patterns from data, but their findings are nonetheless extremely valuable.