Pain in the Data

I have been working on a data analytics project for around 3 weeks, the project aims to visualize and allow querying a database of employees based on their skills, industry, and specialty. It is a very interesting and challenging project, it sounds fairly simple, yet it is taking a surprising amount of time; this is not a bad thing, as I was taking this opportunity to verify a certain fact in data science.

The challenge lies in the data itself. I received the data as a CSV (comma separated values) file with each row being a record of Name, Email, ID, Skills List, Skills Scores List, Region, Industry, and Specialty. I was working in Python and developing a web app running on IBM Cloud, so I went with Pandas library to handle the data for me. Just uploading the data to a database on the cloud was painful. Converting the CSV file to a JSON (JavaScript object notation) format was a challenge because the data was organized in such a way that each employee had one row for each region, skill or industry or specialty. I essentially had to:

  1. Combine all rows for an employee into one row
  2. Clean the data types
  3. Convert to JSON

It took me a week just to clean the data types, and this was just the first step in the project: uploading the data to Cloudant NoSQL database. One might argue why did I use JSON and NoSQL whereas I could have used a table format and SQL database? There are two main reasons, primarily because I am more comfortable working with NoSQL, and second because I was doing an experiment.

Then came the challenge of querying the data, once I received the query identifying the requested combination of region, skills, industries, and specialty. Structuring the data right for a query was a challenge which took around 3 days to address; if it weren’t for the Pandas library, I would have taken maybe a week or two. Funny enough, the total time I spent on building the structure of the web app, log in, and user interface all in all took around 2 or 3 days.

This little experiment of mine shows a very important fact about data science and analytics:

“80% of the time is spent cleaning the data”

I spent around 10 days to clean and prepare the data, and just 4 days to query and build the web app. Lucky enough I was doing everything in Python which provides a set of great tools and libraries for data science. My choice of database was not the best for this application, but in a real-life situation, not everything is so sweet, you almost always have to restructure, reformat, and reorganize the data.

Written By
More from Aoun

Understand Your Data

By now, everyone understands the importance of data, and how important it...
Read More
Tags from the story
, , ,

Related posts