Preparing for a Data Science Project

When starting a new data science project, there are a few things to consider and keep in mind. So I’m going to use this blog to sort of map out the process and the steps I plan to take in order to get through this project.


The first thing you need to decide on is the reason behind doing the project. Just having an idea is not enough because the purpose and objective keeps you focused and plays a part in the decisions you make each step of the way. Are you looking for trends, is it going to be a classification or prediction problem, what’s the real life application? All these will be questions during the project and your objective helps you answer them.

The project I plan to start working on is centered around real estate and housing prices. There are a lot of people that make money by flipping houses; buying property that may be a bit rundown, renovating them, and then selling them for a profit in a relatively short window of time. For me, I will be considering the short window to be selling the house within six months of buying it. Traditionally, you would look at various features such as school district, zip code, size of the property, closeness to public transportation, etc,. A lot of the experts use their experience over time to get a feeling for which houses will be good for flipping, and the purpose is to use machine learning and modeling to predict which properties will be good for flipping. If successful, it would take out the guesswork and make the field more open and accessible for those new to it.


Once you have your objective, the process can be broken down into five steps summarized in an acronym: OSEMN. What this means is to Obtain and gather the data, Scrub or clean the data, Explore the data, train and test your Models, and lastly iNterpreting the data.

This step may be obvious, but you can’t do data science without the data. If you’re working at a company, you might be given the data or you might have to collect it yourself. If you’re doing a personal project like me, chances are that you will have to collect your own data, although there are some places you can get datasets from such as Kaggle.

In my case, my thoughts are to get my data from a real estate website such as Zillow. They do have an API, so I will use that if I can get the data I need, and if not I will build a webscraper to get the data I need. Depending on the limitations of the API or the time needed to build the webscraper this could take a week or so to get the data.

The next step will be the most tedious and time consuming part. I know this from previous projects and pretty much any person in data science will tell you the same. You need to make sure the data is in the correct format, account for any categorical variables, look into missing values and how to address them, and if you have data from different sources, you need to make sure it matches up correctly.

After the data is cleaned, you can explore the data and look for any initial trends or findings. You look at the distribution of your variables, see if there is an imbalance in your target variable, and anything else that may be interesting or that you can use to guide you in the next step.

This might be the most important part of the process because the models you train and test are what you use to drive the insights and conclusions of the project. However, the success of the modeling lies heavily in how well you did the previous steps. Is the data cleaned and formatted properly, are you getting the correct results, are you looking at the right features. One thing to remember, especially at this step, is that there very well may be some back in forth between the models. You run your models, do some feature engineering, hypertune some variables and see what happens. You also have to decide the metric you want to focus on, and this is largely dependent on the objective and real world application of the project.

This may be the final step, but it does not mean the project is over once you’re here. The insights you derive from the results may very well lead to more EDA or taking another look at your data and then running more models and interpreting again. You may see a trend you want to focus more on or you might have really strange results and you need to see if you can explain or get an idea of why that’s happening.

As my instructors have told me, it’s an art and you have an outline of steps, but it’s not that clear cut in actuality. The actual process is more convoluted and may get confusing or tedious at times. The important thing is not to give up, but to persevere in making sense of the results and not being afraid to ask for help or another opinion when needed.

Data Science student at Flatiron School