come for the data science stay for the memes hello world it’s Suraj and data science is the hottest career to get into this year every industry is collecting customer data and using it to make smarter decisions which leads to higher profits the demand to fill data science positions is through the roof globally and forecasts reveal that this demand will only increase in the coming years so to help you take part in this rapidly growing field I’ve created a three-month curriculum to take you from absolute beginner to proficient in the art of data science this open source curriculum consists of purely free resources that I’ve compiled from across the web and has no prerequisites you don’t even have to have coded before I’ve designed it for anyone who wants to improve their skills and find paid work ASAP either through a full-time position or contract work you’ll be learning a host of tools like sequel Python Hadoop and even data storytelling all of which make up the complete data science pipeline data science is the area of study involving extracting insights from data and a data scientist sits at the intersection of math software engineering and data communication or the ability to communicate insights from data there are a lot of related positions in the field ranging from machine learning engineer to data analyst to business analytics specialist usually a data scientist is expected to formulate the questions that will help a business and then proceeds to solve them while a data analyst is given questions by the business team and pursues a solution with that guidance on the other hand a machine learning engineers goal is to build and optimize predictive models there’s a lots of intersection between data science roles but the data scientist is usually the most senior role for example if we look at a data scientist job position hiring page at one of the big four tech companies like Google or Facebook will see that they expect several years of experience and irrelevant undergraduate even graduate level degree that’s because they can afford to do that everyone wants to work there and they have more data than anyone else on the planet so they set the bar very high but don’t get discouraged by that if you’re applying as a first time data scientist it’s best to avoid applying there and instead applying to a lesser demanding role like a data analyst data science jobs at smaller companies are much more forgiving and you can make up for both a lack of experience and any gaps in formal education by showcasing the depth of your skills if you start your career there you can work your way up to one of the bigger companies or of course start your own data science business I’ve divided this curriculum up into three months the first month focuses on data analysis month 2 is all about machine learning and the last month will have us learn production grade tools like spark and Hadoop that data scientists use in the real world before I start describing the curriculum keep in mind that we are practicing accelerated learning yes each week of my curriculum consists of a full online course that’s supposed to take several weeks but we’re concerned with efficiently downloading as much knowledge into our brains as fast as possible to do this watch course videos at 2x or 3x speed using a browser extension dedicate 2 or 3 hours every day to studying handwrite notes as you watch for increased memory retention which has been proven and complete just one of the projects of your choice from each course at the end of the week to help synthesize the ideas you’ve learned also while you’re learning immerse yourself in the community by following this great list of data scientists for the first week will want to learn Python perhaps the most important tool in the data science pipeline it’s a highly versatile programming language that’s used across many different industries EDX has developed a great course made for absolute beginners to learn Python specifically for data science it takes us from Python language fundamental up to creating plots using real data additionally I’ve developed a fun learn Python for data science playlist so definitely check that out once we have a basic grasp of Python in the second week we’ll want to take the statistics and probability course at Khan Academy it’s actually really fun Khan Academy’s website has gotten better every year the course has interactive content and they make it feel like you’re playing a game due to the mastery points system it covers topics like probability distributions random variables and hypothesis testing all of which are supremely useful in the data science pipeline after we have a bit more of a mathematical foundation we can start learning how to perform all sorts of exploratory data analysis techniques some of which use probability and statistics this is the process of summarizing the main characteristics of a data set Georgia Tech released a course called introduction to computing for data analysis that demonstrates how to pre-process analyze and visualize a data set the important thing about this course is that most of the focus is on data cleaning and in the real world data scientists will be quick to tell you that most of their time is spent cleaning data real world data is messy it’s not like kaggle where we get neatly packaged data sets its unlabeled it’s got missing values irrelevant features so learning how to carefully sculpt a data set so that it’s ready for further analysis is crucial speaking of Kaggle the website has become a phenomenal resource for data science enthusiasts it’s become not only a place for data scientists to compete for prize money by solving problems for companies but an incredible learning resource in fact Kaggle has a learn section that contains courses on a series of tools you’ll need to understand data science each course is a series of well-documented cago kernels which are their version of jupiter notebooks my only gripe is that there’s no video content or assignments but an awesome resource nonetheless definitely something to browse for week four spend the week solving a Kaggle competition that you personally find interesting that’s the best way to stay motivated pick a completed competition and briefly view one of the Colonel’s to get some sense of what people have done before then create your own repository and get to work document the project very well on your github profile so that anyone who views it can run the code if they follow your instructions including any future employers remember github is the new resume now that we know how to clean a data set and explore its different features and relationships we can start diving into the art of machine learning machine learning models help us derive insights from data sets correlations classifications clustering there’s a lot of possibilities there are several mathematical disciplines that make up ml and I’ve got a cheat sheet for each of them that lists the most relevant parts you’ll need to know in the video description Columbia has an excellent course called machine learning for data science and analytics on EDX it starts with concepts like search trees and linear programming applied to a real-world personal genomic data set to give us an algorithmic foundation then it moves into popular machine learning techniques except for deep learning deep learning is the subset of machine learning focus on just one type of model neural networks the online deep learning book specifically parts 1 & 2 will get you up to speed on deep learning very fast so spend week 7 reading that additionally I’ve got a deep learning playlist on YouTube that’s very extensive for week 8 it’s time for Kaggle project number 2 this time with the focus on different ways of using either machine learning or deep learning to solve a problem or last month we’ll focus on learning how the modern data science pipeline works data sets usually live in data bases so learning how to work with data bases is important Udacity is intro to relational databases is a relatively short but detailed introduction to the basics of structured query language or sequel and database design as well as the Python API for connecting Python code to a database will also fit in another short course into this week on the other type of database no sequel the intro to no sequel data solutions course by Microsoft on EDX is perfect for this it leads us through the three V’s of no sequel variety volume and velocity by demonstrating popular examples like MongoDB for week 10 we’ll move on to Hadoop and MapReduce as Google grew it had to index more and more data over a billion pages of content and in order to cope they invented a new style of data processing known as MapReduce Hadoop was created to apply these concepts to an open source framework that anyone could use data scientists use MapReduce to process data frequently and the intro to Hadoop and MapReduce by cloud era course on Udacity is the perfect way to get familiar with these concepts there’s also another framework called spark that is newer than Hadoop and is getting a lot of attention because it’s useful in different ways think of it like an extension of Hadoop Stanford has a one day workshop on spark and we can use the associated slide deck tutorial to learn more when you’re working on a team as a data scientist often you’re tasked with communicating your results to people in different teams so important business decisions can be made Microsoft has a course on EDX called analytics storytelling for impact that perfectly fits this use case and for the last week complete one more Kaggle project so you have three great demos to show the world once you finish this course you can start applying for jobs doing contract work start your own data science consulting group or just keep on learning remember to believe in your ability to learn you can learn data science you will learn data science and if you stick to it eventually you will master it oh and find a study buddy to keep you motivated I’ve created a data science in three months channel in our slack group to help you find one good I’m rooting for you please subscribe for more programming videos and for now I’ve got to clean my data so thanks for watching