Excel transposing adventures

Introduction

I deal with data all the time. One of my pet peeves is Excel spreadsheets in which the person responsible for creating it has built it in such a way that you cannot pivot the information without restructuring the content. So to remedy this situation I created a quick and dirty tool to perform a specific task (transposing CSV files).

The tool is named Hydra, as every time I see a spreadsheet like this, I immediately freeze while my mind drifts off to consider how much time I will need to waste reformatting the information. Honestly, this task is really dull and I wanted a simple tool to do this, so I did not have to do it over and over again.

The problem

To express the problem, I am commonly faced with excel files that have been created to record large volumes of data. The issue is that due to the format selected to store the data in, it often makes it difficult to pivot, without first performing some Jitsu (data wrangling) on the data. For example, consider the following dataset

Surname First Name Discipline 28-May-12 04-Jun-12 11-Jun-12
Allan Catherine Manufacturing 7 7 7
Babbcock Neil Design 2 2 2
Cook David Engineering 5 3 4
Dumas Sue IT 7 2 7
Marks Jan Manufacturing 1 3 2
Polly Paul Manufacturing 0 3 7
Solmon Christopher IT 5 0 0
Thatch Andy Engineering 2 5 0
Whedon Kelly Design 5 7 2

While the information presented make logical sense, in my case, this format is problematic. Ideally, the last three date columns should be consolidated to give just a single date, enabling me to run a quick pivot on results, with a new structure similar to the revised example table below.

Surname First Name Discipline Date Hours
Allan Catherine Manufacturing 28-May-12 7
Allan Catherine Manufacturing 04-Jun-12 7
Allan Catherine Manufacturing 11-Jun-12 7
Babbcock Neil Design 28-May-12 2

However to achieve this outcome is a lot of cutting and pasting effort. The key difficulty here is I want to keep some of the original data and make a new consolidated column. The new column reflects the original data but in a more accessible format.

The solution

Thinking about this problem and the frequency with which I am faced with this task, lead me to the conclusion I should invest some effort on writing some quick code that could automate this process. I selected Python as it is something that leads itself to quick programming and has routines that are very good at handling data.

The Python code essentially does a number of things, but here are the key factors.

  • Read information in from a CSV file
  • Indicate the number of columns in the data read in
  • Determine which columns can remain
  • Perform a quick calculation to establish how many columns will need to be transposed
  • Loop through the data, taking each column to transpose and appending this as a new record (complete with static columns)
  • Output a transposed CSV file

Lazy nirvana achieved. The code is very much beta, meaning at the moment I do not see any need to industrialise it. However, it works and means I no longer have to turn to stone when I see a large data set in Excel.

Hydra on Github

 

11 comments

  1. You’ve got to get on PowerQuery STAT! If you have Excel 2013 it’s a download and it’s fully built into the application in 2016. This transposing of data is now a built in function so no more use of external code. Quite amazing – though kudos on using Python with Excel I see some great advancements coming up as Microsoft starts allowing other languages outside of VBA.

    Liked by 1 person

    • Hi Alex, thanks for the comments.

      I actually have PowerQuery on my laptop and agree its a great tool. I did not know it was built into 2016, this is a positive move by Microsoft as these kinds of add-ins are incredibly useful.

      The reason I use Python in the example in preference to Excel is that I have tried to automate my work as much as possible (outside of Microsoft tools). What I tend to have are multiple inputs (PDF, CSV, etc) that need to be wrangled often in a variety of ways. Rather than keep on explaining an often convoluted process to the team, I thought why not just write a piece of software. I find Python is great for quick and dirty programming. I do not tend to fire up Excel as much to get things done these days. However it has not reduced my love of Excel.

      Like

      • Completely understand – working for a major corporation I always find I have to color within the lines due to others software restrictions – for instance we use 2013 and Power Query is an add-in that requires management approval, an IT packet for install… blah blah blah… so automation is my go to as well since I can’t even touch some of the great BI services that are now becoming available. Sadly we don’t have python due to such restrictions so a lot of VBA and VBScript to automate our work environment. I plan on learning more about pyXL here on my personal time though once my schedule slows down a bit more. Thanks for the great article and providing some insight into your work.

        Like

  2. Hi Alex, We do have a bit of latitude at work when it comes to software (although not that much). Python is actually a really fun language to learn, so I would definitely recommend it.

    Thanks for the kind words on the article. I am just working on the Android update for the Udacity Nanodegree and I hope to finish that sometime this week. This is something else that is really interesting and easy to get into.

    Like

  3. First off I would like to say great blog! I had a quick
    question in which I’d like to ask if you do not mind.
    I was curious to know how you center yourself and clear your head
    prior to writing. I’ve had a hard time clearing my mind in getting my thoughts out.
    I do enjoy writing but it just seems like the first 10 to 15 minutes are wasted simply just trying to figure out how to begin.
    Any recommendations or hints? Kudos!

    Like

Leave a comment