In my last post I went over the differences and similarities between data engineers and data scientists. In this post I’ll go over the same in regard to data analysts and data scientists.
In short, data analysts sift through data and try to find trends. They try to extract stories from the numbers and come up with possible business decisions from the insights they derive. They also are more likely to create visual representations to showcase their findings and interpretations.
A data analyst can be thought of almost as a junior data scientist.
A typical data analyst job description usually…
The terms data science and data engineering get thrown around a log, but what is the difference? What are the similarities? Both have to do with vast amounts of data, but where do they diverge, and where do they overlap?
Data engineering involves preparing the data infrastructure for analysis. This usually involves extract, transform, load (ETL) operations. Engineers are focused on building and maintaining the data pipeline systems and attributes such as formatting, scaling, and security.
Data engineers are more likely to have a software engineering background or other engineering background. They may be proficient in other computer languages than…
Perhaps one of the best books on python and data science is Géron’s Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow. This book is especially good for beginners in my opinion. For self-taught data scientists, this book is an absolute must. There are several guided projects with code included to guide aspiring data scientists along the way.
Knowing Scikit-Learn, Keras, and TensorFlow are also absolute musts for data scientists as employers are keen on these libraries and their modules. A brief summary :
Scikit-Learn is a machine learning library for Python featuring various algorithms for statistical analysis, supervised, and unsupervised…
Python and R both have their strengths and weaknesses when it comes to data science. One language isn’t necessarily better than the other, but it comes down to the application and the solution to the questions you’re trying to answer. Data scientists should know both languages to some degree as they are the most used languages for data analysis and statistics. A very basic distinction is that Python is a general purpose language while R was developed for statistics specifically.
Diabetes is a chronic medical condition which is estimated to affect 415 million people in the world. 5 million deaths a year can be attributed to diabetes-related complications, and it is a comorbidity associated with COVID-19 deaths.
Early stages of diabetes are often non-symptomatic which is problematic for early detection and diagnosis. Poor diet, lack of exercise, and excessive body weight are significant causes. This is related to the onset of type 2 diabetes which is the result of the body’s gradual resistance to insulin. …
Football is a wonderfully complicated game. It is akin to two armies lining up and moving up and down the battlefield. Coaches routinely read Sun Tzu’s The Art of War to hone their skills. Data analytics can help coaches get an advantage on their opponents; football is not merely a game of talent, but more a game of strategy like chess.
Football is an 11-on-11 matchup with many combinations of player-on-player interactions which will influence the play of the game. An 11-on-11 matchup can be complicated to model or predict, however, many of the possible interactions rarely happen, e.g.: a…
Chapter 4 of Neural Network Projects with Python goes through a guided project for classifying cats and dogs from a dataset provided by Microsoft. The best way to classify images, in my opinion, is by using a convolutional neural network (CNN).
I have used the VGG-16 CNN to classify gold deposits in Austraila for my capstone project with the Flatiron School Immersive Data Science Bootcamp. CNNs can be useful for a variety of image classification and segmentation problems.
This scenario is a pretty basic classification, binary, which doesn’t even require a GPU. To run more complicated image classification problems with…
I recently came across a great resource, Neural Network Projects with Python, by James Loy. I am fascinated with neural networks and their applications and always looking for new projects. This book was a perfect fit with my skill level and interests, also it comes with a great github repository complete with code and solutions.
The first chapter is fairly easy to follow as it goes over a basic neural network architecture with a straightforward prediction problem. …
Clustering is an important technique for unsupervised learning algorithms. It refers to grouping similar data points by their attributes. In this post I will go over Gaussian Mixture Models for clustering.
Gaussian Mixture Models (GMMs) differ from other clustering models in that they assume a certain number of Gaussian distributions, each of which represents a cluster. GMMs will group data points belonging to a single distribution together.
Gaussian distributions, also known as normal distributions, have certain parameters and qualities. There is a mean and standard deviation and a curve having a general probability density function.
Perhaps one of the most widely used methods of unsupervised learning is k-Means clustering. In simple terms, the algorithm partitions the data into k distinct clusters based on the distance to the centroid of a cluster. This should not be confused with k-Nearest Neighbors clustering which is a supervised learning method.
Clusters must have certain properties. All points must be similar to each other to a certain degree. Having similar properties within the same cluster helps the algorithm group them toward the same centroid.
Data points from different clusters must also be as different as possible in order for the…
I am a data scientist having recently graduated from the Flatiron School Immersive Data Science Bootcamp