If you have googled machine learning, chances are that you’d come across projects developed in python. Python is a versatile language whose applications spans from automating boring stuff (pun intended) to deep learning, data mining, analytics. The scope is just endless. You can think of python as the base and you can build your project on with the help of some relevant packages.
The advantage of python lies in the fact that there are over hundred packages you can explore in python to suit your needs and you don’t have to spend too much time learning python itself to get started.
In this blog post, let’s look at some packages to get you started with your machine learning project in python.
Data scraping and readymade datasets for practice:
First, let’s get started with datasets. A good dataset is the starting point of a machine learning journey. In fact, it is the core of the project itself from what preprocessing techniques you use to choosing algorithms that deliver the best accuracy. If you already have a dataset for your project, you can skip this step. Datasets are what we are going to work on, create insights on and find a solution for your needs. Usually, datasets come with a problem statement which is a godsend for beginners!
Packages like scikit-datasets already have datasets that you can just load into your program.
To do that, start with installing scikit-learn,
pip install scikit-learn
Then, import datasets from sklearn and load one of the toy datasets from the following list .
from sklearn import datasets data:datasets.load_iris()
In addition to this, there are more datasets which can be downloaded from kaggle or uci repository which might fit your needs better.
You can also scrape data from the web for your projects, provided that you get necessary permissions for it, using packages like scrapy , requests , and beautifulsoup .
Data manipulation, Preprocessing, and simple processing:
The next step you will most probably concern yourself is the data-cleaning, pre-processing or data manipulation stage of your dataset. Pandas is a great package to get started with for data manipulation, whether it is adding new columns or joining two datasets into one dataframe. Loading the dataset as a pandas dataframe makes manipulation and processing easy for the developer.
Install pandas on your system with the following command in your terminal,
pip install pandas
Before training the dataset on some of the machine learning algorithms, you might have to modify your dataset, such as changing the non-numerical data to numerical or normalizing the numerical some columns, etc. Scikit’s preprocessing module is built just for that. It is a great module and has provisions from label encoders to minmaxscalers . You can read more about them here .
An easy example,
from sklearn.preprocessing import MinMaxScaler from sklearn import datasets from sklearn.cross_validation import train_test_split
dataset:datasets.load_iris() X:dataset.data y:dataset.target minmax:MinMaxScaler() minmax.fit(X) X_t:minmax.transform(X) print(“raw: ”, X.head(), “processed: ”, X_t.head())
Plotting packages:
Visualizing your data is important to understand your dataset, and to know how far you have progressed with your project. There are many packages that create phenomenal graphs to picture your dataset. Matplotlib and seaborn are two packages with great examples to get started with.
You can install both the packages with the following command,
pip install matplotlib seaborn
Check out some ~cool~ graphs we made with matplotlib and seaborn here .
Natural language processing:
Can computers understand human language? How do chatbots hold conversations with users? How to analyze tweets? Running a sentiment analysis on Reddit comments? Python has you covered from scraping the data from Twitter or Reddit to running algorithms on your dataset. You can use python packages like NLTK and textblob to carry them out for you.
You can install NLTK and textblob with the following command,
pip install nltk textblob
The neural net, Deep learning, reinforcement learning, etc:
Artificial neural network and deep learning have been the talk of the town for some time now and for good cause. Optical character recognition, price prediction to classifying entities in chatbots, all find their applications using neural nets. Python support some of the best deep learning packages to existing like keras , theano , chainer and google’s tensorflow . All of these packages come with great examples to get started with for beginners as well.
Another notable package that basic neural nets are the Scikit-learn with its neural network module for small applications.
Video and image processing:
Image classification, processing, and video processing are one of the most interesting parts of what python is capable of. Python supports awesome packages such as openCV , scikit-image , pillow and mahotas which do a brilliant job at image preprocessing and video processing hand in hand with numpy.
So, now that you have learned about your toolkit, get out there and explore!