Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Block quote
Ordered list
Unordered list
Bold text
Emphasis
Superscript
Subscript
If you're a software developer curious about the life of a data scientist (or you are thinking about a data scientist career), this article will give you a glimpse into our world. While we work with code like you do, our challenges and workflows often differ. Let's dive into a typical data science project, the skills we need and the tools we use.
It all begins with defining the project goals. This means understanding the business need, evaluating the difficulty, and setting realistic expectations with the customer. Here's the first major difference: in software development, you usually know that your goal (e.g., a website or app) is achievable. In data science, we often don't know if a solution is even possible.
Our typical brief is something like, "Here's the data - help us create a model to predict the future." But can the data actually predict anything? It's a mix of known and unknown unknowns. From day one, we explain that data science is iterative: we learn step-by-step and adjust as we go.
Next, we gather domain knowledge from subject matter experts. This helps us understand what the data points actually mean in real life. For example, numbers in a table are no longer anonymous once we know they represent "milk yield" or "machine pressure."
The data itself can be scattered. Sometimes it's an Excel file buried in someone's email, sometimes it's split between on-premise servers and an incomplete cloud environment. And sometimes the "cloud" turns out to be a mythical concept with no actual data. Getting everything in one place can take far longer than expected.
With the data in hand, the next step is to explore. EDA is where we visualize the data, look for patterns, and identify issues like missing, duplicate, or inconsistent data. We work iteratively here too. If the data quality is poor, we go back to fill gaps, clean errors, and try again. This process can feel endless, but it's critical for building reliable models.
Once the data is usable, we begin experimenting. This could mean building predictive models or designing experiments to collect better data. In a project where we optimized settings of some industrial machines, we worked with the customer to test how certain pressure settings on the machine affects the final product. These experiments provided the data we needed to refine the model.
Finally, we develop machine learning models. We need to pick the right modelling technique for the use case, optimize the training parameters, and train the model using our data. We evaluate performance of the model using metrics like accuracy, precision, recall, mean squared error etc.
But beware - metrics can be misleading. For example, if we try to predict a sickness and 90% of patients are healthy, a model that always predicts "healthy" looks great on paper with its 90% accuracy, but fails in predicting the actual sickness. We address these issues with techniques like resampling the data or using more nuanced performance metrics.
Data science is built on mathematical algorithms. A good data scientist needs to understand the fundamentals, including when a technique works, when it doesn't, and how to avoid common pitfalls. Some typical mistakes beginners tend to do are:
We use Python, Bash, Git, Docker and other tools to process data and productionalize models. Writing clean, maintainable code is as important for us as it is for software engineers.
Every project demands some understanding of the domain. Whether it's cow biology or pastry dough physics, we need to know what the data points represent to make meaningful models.
We set expectations with clients, gather insights from subject matter experts, and present findings in a simple, understandable way. We can't just throw equations at stakeholders; we need to make the results relatable. Therefore, the data scientist can't be just a number crunching nerd, they need to acknowledge that communication is key.
Not every project is a success. In fact, less than 50% of data science projects lead to actionable results. This isn't because we're bad at our jobs-it's the nature of research. Some projects reveal that no predictive relationship exists in the data, while others stall due to changing priorities.
The key is setting expectations early. We tell clients upfront that data science is exploratory. If we uncover opportunities, great. If not, they've still learned something valuable.
The life of a data scientist is a blend of math, coding, communication, and creativity. While we share tools and workflows with software developers, our focus on uncertainty and iteration sets us apart. If you enjoy solving puzzles, learning new things, and working across disciplines, you might find data science as exciting as we do!