Cookie Consent

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Cookie preferences

If you're a software developer curious about the life of a data scientist (or you are thinking about a data scientist career), this article will give you a glimpse into our world. While we work with code like you do, our challenges and workflows often differ. Let's dive into a typical data science project, the skills we need and the tools we use.

A typical data science project

Setting expectations

It all begins with defining the project goals. This means understanding the business need, evaluating the difficulty, and setting realistic expectations with the customer. Here's the first major difference: in software development, you usually know that your goal (e.g., a website or app) is achievable. In data science, we often don't know if a solution is even possible.

Our typical brief is something like, "Here's the data - help us create a model to predict the future." But can the data actually predict anything? It's a mix of known and unknown unknowns. From day one, we explain that data science is iterative: we learn step-by-step and adjust as we go.

Understanding the data

Next, we gather domain knowledge from subject matter experts. This helps us understand what the data points actually mean in real life. For example, numbers in a table are no longer anonymous once we know they represent "milk yield" or "machine pressure."

The data itself can be scattered. Sometimes it's an Excel file buried in someone's email, sometimes it's split between on-premise servers and an incomplete cloud environment. And sometimes the "cloud" turns out to be a mythical concept with no actual data. Getting everything in one place can take far longer than expected.

Exploratory data analysis (EDA)

With the data in hand, the next step is to explore. EDA is where we visualize the data, look for patterns, and identify issues like missing, duplicate, or inconsistent data. We work iteratively here too. If the data quality is poor, we go back to fill gaps, clean errors, and try again. This process can feel endless, but it's critical for building reliable models.

Experiments and modeling

Once the data is usable, we begin experimenting. This could mean building predictive models or designing experiments to collect better data. In a project where we optimized settings of some industrial machines, we worked with the customer to test how certain pressure settings on the machine affects the final product. These experiments provided the data we needed to refine the model.

Finally, we develop machine learning models. We need to pick the right modelling technique for the use case, optimize the training parameters, and train the model using our data. We evaluate performance of the model using metrics like accuracy, precision, recall, mean squared error etc.

But beware - metrics can be misleading. For example, if we try to predict a sickness and 90% of patients are healthy, a model that always predicts "healthy" looks great on paper with its 90% accuracy, but fails in predicting the actual sickness. We address these issues with techniques like resampling the data or using more nuanced performance metrics.

The skills you need

Mathematics and machine learning fundamentals

Data science is built on mathematical algorithms. A good data scientist needs to understand the fundamentals, including when a technique works, when it doesn't, and how to avoid common pitfalls. Some typical mistakes beginners tend to do are:

  • Using too simple or too complex models: The cutting-edge techniques are not always the answer. Better start simple and add complexity when needed.
  • Ignoring time dependencies: If you work with time series data, past events usually influence the future. Ignoring this and using the individual observations in isolation leads to poor predictions.
  • High-dimensional spaces: In 10+ dimensions, proximity measured using the "good old" Euclidean distance becomes counter-intuitive, so you may rather use alternative distance metrics.

Software engineering

We use Python, Bash, Git, Docker and other tools to process data and productionalize models. Writing clean, maintainable code is as important for us as it is for software engineers.

Domain knowledge

Every project demands some understanding of the domain. Whether it's cow biology or pastry dough physics, we need to know what the data points represent to make meaningful models.

Soft skills

We set expectations with clients, gather insights from subject matter experts, and present findings in a simple, understandable way. We can't just throw equations at stakeholders; we need to make the results relatable. Therefore, the data scientist can't be just a number crunching nerd, they need to acknowledge that communication is key.

The tools we use

Programming languages

  • Python: Our go-to language, thanks to its rich ecosystem of libraries (e.g. Pandas, Scikit-learn, TensorFlow) and simplicity.
  • R: Great for statistical modeling but less common for production work.

Notebooks

  • Jupyter Notebooks: A favorite tool for interactive coding. It allows us to write, test, and iterate code step-by-step, making debugging and visualization much easier.

Version Control

  • Git: The golden standard for code versioning.
  • MLflow: For tracking machine learning models and their performance.
  • DVC: For versioning datasets without clogging Git repositories.

Navigating success and failure

Not every project is a success. In fact, less than 50% of data science projects lead to actionable results. This isn't because we're bad at our jobs-it's the nature of research. Some projects reveal that no predictive relationship exists in the data, while others stall due to changing priorities.

The key is setting expectations early. We tell clients upfront that data science is exploratory. If we uncover opportunities, great. If not, they've still learned something valuable.

Conclusion

The life of a data scientist is a blend of math, coding, communication, and creativity. While we share tools and workflows with software developers, our focus on uncertainty and iteration sets us apart. If you enjoy solving puzzles, learning new things, and working across disciplines, you might find data science as exciting as we do!

Author

Get to know the author of this blog post

Pavel Sůva
Data Scientist, Team Lead at Datamole

Blog

Read our other recent blog articles