How to explore raw data to get useful information by Machine Learning?

Some useful data exploration tips and tricks...



First of all, collect raw data from various sources, now a days a huge amount of raw data is available on the internet just collect and put in your system.


After setting raw data in your system always do an analysis of raw data before applying any operation and try to make full pipeline end to end in your mind and then move ahead.


The first challenge that every data scientist will face is due to messy data. A messy data is a data that is not formatted correctly and it contains missing values. So the first task is to handle these missing values and there are various methods for handling these missing values. Method:1>We can simply remove these missing values but it is not a good way because it will affect our analysis Method:2>We can fill these missing values with some mathematical concepts e.g mean, median, and mode and it totally depends on the situation.




After handling missing data the next challenge is all about outliers. Outliers are some exceptional data that is present in our datasets so we need to remove this also and there is a field called Feature Engineering.


Feature Engineering -> It is a study of how we can make our whole datasets more useful without any outliers. So there is various type of techniques for outliers removal. We can remove outliers with mean and standard deviation method, z-score method, etc.


After completing all previous steps our data will become much cleaner and it totally depends on the skills of the person.




Visualization -> It is the best way to understand data graphically and there are many libraries in python for e.g matplotlib and seaborn.



Now after analysis of all data, our data is ready for Machine Learning.




Before start using the machine learning algorithm first, we have to split our datasets into the train-test so that we can use some part of the datasets in training, and the remaining part will be used for testing our model. There is a various method in python to split datasets.


After splitting of data you can use the Sklearn module in python for loading machine learning algorithms. Before loading the algorithm we need to understand our output whether it is regression type or classification type.


If the output is regression type then use Linear Regression etc.

If the output is classification type then use Logistic Regression, Random Forest, etc.


After deciding you can validate your model with a cross-validation technique and then decide your final model.