Thursday, March 9, 2023

What the ETL....

We have a problem...

As part of our HR dashboard development we are putting together a time-to-fill (TTF) dashboard.  Time to fill is a measure of how long it takes to fill a position.  For our purposes, we are breaking it down to help identify where there may be bottlenecks in our process.  Specifically we are looking at the following intervals:

  • Requisition Receipt in HR to Posting
    • This is the date that the requisition is received in HR.  This may have been through a formal approval process or is not required to go through the formal approval process.  Either flow requires a requisition to be completed.  The requisition provides HR with the pertinent information required to be able to start the recruiting process.  This is measuring the time in days between when HR receives the requisition in the ERP until it is posted in our applicant tracking system (ATS).
  • Posting to Candidate Selection
    • The starting point of this measurement is when the recruiter enters the data received in the previous step and ends when the recruiter sets the status to candidate selection to indicate that the requesting department is screening and interviewing candidates for the position.  Some positions may take longer due to factors such as availability of talent in the market or internal posting requirements.
  • Candidate Selection to Offer
    • Candidate selection includes any resume screening, phone screens, interviews, etc... this process continues until there is an accepted offer.  The offer must be accepted because if we stopped this stage when an offer is made, someone may not take the offer, resulting in days or weeks being added to this step in the process.
  • Offer to Start
    • This phase remains largely outside of the organization's control.  While the organization can establish time limits on tasks such as drug testing, much of the process is within the candidate's hands.  Some candidates might get their paperwork completed and documentation submitted quickly while others will move much slower.  Some candidate want to give their current employer a longer notice period than others who may not need to give notice.  If the orientation session is full, we may be waiting an additional week for the next orientation session.
  • Time-To-Fill (TTF) is calculated from when HR receives the requisition until the offer letter is signed.
  • Time-To-Start (TTS) is calculated from when the HR receives the requisition until the employee reports to orientation.

This is great but... what is the problem?

We have a few factors working against us here.

  • Data is an afterthought.  Not only was what we wanted to calculate an afterthought, but so was what we would consider in the calculation, where the data is located, how it is stored, and how do we get it someplace where we can work with it.  In talking with my colleagues in our Data Professionals group, this is a common scenario.
  • Pooling of positions helps save the recruiting team time, but is makes the data messy.  Pooling is the process of combining several jobs or requisitions under a single posting.  This is great for recruiting because it makes the process of posting a position much quicker as they don't need to do as much work for the pooled position.  It is also nicer for applicants because they don't have to see that we might have 20 positions posted for a single title.  It is a great process and I highly support it, but it means that we have to do some extra work with our data.
  • Lack of consistency in our input.  Just because we use a number, doesn't mean it is is a number.  In an of itself, this is fine.  Unless we need to calculate a value, it is best to consider using a string rather than a number.  Strings made of numbers are great identifiers.  It is okay if we consistently use a letter to identify that a job is posted to an agency, but people aren't always consistent resulting in that agency identifier to be missing and our posting number then looks like any other posting number that isn't posted to an agency.  Considering that the field isn't a numeric field, recruiters sometimes decide that it would be helpful to have a note in the field rather than the requisition number.  Oh people, I love them.
  • Some of our data is in this system and some of it is in that system and we didn't pay to have the systems talk.  Of course, the API documentation doesn't indicate that all the data that we need is available in the API, either.  The report writer in the ATS appears to have a lot of data available that isn't available in the API.

Time for the duct tape an bubble gum, what the ETL!

Not getting the data from one system to another is not an option.  We need to figure out how to Extract, Transform, and Load (ETL) the data.  The data between the two systems are not compatible in their basic forms and need to be massaged.  This massaging process includes extracting it from the ATS, transforming it to be compatible with the data in our ERP, and then loading it into the ERP so that we can access it in Tableau.  We're still working through this process but are nearly there and I love sharing the challenges and learnings that I have had in the process.

Extraction 

Before we could even extract the data, we had to define what it is that we want.  As good as the report writer is, the way that the data is organized isn't particularly intuitive and the ATS documentation doesn't dig very deep into what data is stored where or what the data means.  As we are extracting a report that we are building, there was a lot of back and forth looking at the postings and the available attributes in the report writer.  After a lot of moving fields, verifying they are what I think they are, and more verification, we arrived at a draft report for export.  Of course, after exporting it, I realize that there was way too much data to run it through PowerQuery in Excel - so back to the report to set up some filters to limit the data and then back to Excel.  After a few of these back and forths, I was able to generate a report that was small enough to be reasonably transformed in PowerQuery in Excel.

Transformation

What can I say, the data is messy - much more messy than I anticipated.  I know enough about PowerQuery to be dangerous and I set off to make the data useful.  The biggest problem that I face is that the requisition system in our ERP generates a requisition number that is not a direct match for what is in our ATS.  The ERP's number is system generated and consistent, the ATS's is entered with human hands.  As previously mentioned, sometimes we have a number, sometimes we don't and sometimes we have additional characters or notes where the requisition identifier should be.  Let's take these one at a time.

  1. When we don't have a number for our individual position and in some cases we have a 1, I know the number is the posting number.  While consistency in procedure would be nice, we live in the real world and people are not as consistent as computers.  A simple formula helps us with this.  Creating a new column called FinalReqNo, we check to see if position number is null or 1, if it is, we copy the posting number to the new column, if not, we copy the existing value to the new column.
  2. When we have a number, it is usually straight forward and we should be able to use that number.  It should be a five digit number greater than 60000.  I hate to drop too many lines, but we have a lot of requisitions to have what I hope to be a statistically representative sample size.  To make it easy, I split the column at five characters.  I want to keep this data so that we can go back and look at what is being cut off and inform the recruiters about the records that we need to fix.
  3. At this point, running the steps in this order, we should have stripped out any text.  This lets me continue with the process and, if we get errors loading the data into Excel, I go back to see what the errors are and either ask the recruiters to fix the source data or tweak the formula to handle the errors.

Beyond this, I have procedures to reorder the columns and set data types.  Fairly simple tasks that help me when staring for this data for hours at a time.

Load

Finally, we can load the data.  If all goes well, the data will load without errors.  If we have errors, as mentioned above, I examine the errors and request the source data to be fixed or adjust the transformation steps to accommodate the errors.  At this point, I repeat the process with out ERP data but the ERP data is our source of truth when it comes to the requisition numbers so the work not so intensive and the process is largely changing some formatting.  I still run it through PowerQuery to ensure consistency in the data as I am repeating this process manually for the time being.

For now, I am merging the data in Excel using Excel's Data Model feature and creating a pivot table.  So far this is working but this is not the end game.  The end game is getting the data into Tableau with automated processes.  This ETL process serves as a blueprint for the development team.  As an analyst, I try to make "code-ready" projects to turn over to our developers.  This seems to make them happy as it is not what they usually get, but if they can develop my stuff quickly, it is likely that I can get priority in the development queue because my projects are quicker to develop.

Next steps

Unfortunately, our ATS is lacking in some desired functionality so we are stuck with duct tape, bubble gum, and bailing wire to hold things together.  While we would love to have an API or SFTP to be able to transfer this data, we are not so lucky and have to rely on the ATS delivering via an email schedule.  Once we get this process rolling and are happy with our results, we may have the ATS provider write us a query but that costs quite a bit of money and we don't want to pull the trigger on such an expense until we are happy with the data in our dashboard.

The email delivery also posed a problem because I had a table calculation, so I am rewriting some of the process to happen in a Tableau data source vs the ATS.  I am happy, however, to get to the point that we are at.  This will allow us to finish the development of our first three scorecards that focus on staffing.



No comments:

Post a Comment

Followers