6 steps to build a predictive model
You may be farther along than you think
April 7, 2022, By James Cousins, Senior Strategic Leader, Data and Analytics
If you’re planning to use predictive analytics for admissions, enrollment, student success, or any other institutional initiatives, you’ve likely come across lengthy lists of what you need to do to prepare. But these rarely give you credit for what you’re already doing with your data. It may therefore come as good news that if you use data in any capacity at all—extracts from your SIS or LMS, internal or external reporting, or even ad hoc data requests—you’re already “accidentally” preparing to build a model.
If your institution wants to build predictive models, you only need to add a few extra steps to what you are already doing. Consider these common steps required for predictive modeling:
-
- Collect data relevant to your target of analysis
- Organize data into a single dataset
- Clean your data to avoid a misleading model
- Create new, useful variables to understand your records
- Choose a methodology/algorithm
- Build the model
In my direct experience across dozens of institutions, any institution with even a basic use of data is already accomplishing at least the first two steps. Furthermore, many institutions are doing the lion’s share of the third step (cleaning data) and the fourth (creating new variables). Most newcomers to predictive modeling only need to tackle steps 5 and 6, which are easier to achieve than you might think.
Steps you’re definitely already doing
1. Collect data relevant to your target of analysis
In today’s world, you almost can’t help but have data on your topic of interest. Student registration patterns? Collection happens when the student registers. Application data? It’s in the system that Admissions uses during admissions season. Even when considering fringe data, or newly introduced data systems (such as mobile apps and event registration platforms), you are collecting data. I could go on, but the point is that you don’t have to “do” it—it happens.
It’s worth acknowledging that the accessibility of the collected data may remain a challenge. For instance, local IT resources may gate your student database, or the vendor may host on a remote, secure server. You can occasionally alleviate barriers through partnerships, proof-of-concept projects, or technology. For now, rest assured, that’s one step (and about 16% of the work) done!
2. Organize data into a single dataset
Organizing huge swaths of disparate data can be a complex, time-consuming element of the overall project. Therefore, it behooves you to focus on a core set of variables for initial passes. If you have the luxury of time, adding supplemental fields can help immensely—but you may be surprised to discover how much a model can tell you based on only the basic information already synthesized in your information systems.
Student information systems and data warehouses like EAB’s Edify frequently store information on prospective student enrollment, student success, or on-time completion in the same table or just a few different tables. What’s more—and this is the real a-ha moment—you’re probably already merging many of those tables and extracts to accomplish your required reporting and answer ad-hoc questions. In other words, you’re probably completing this step for other reasons anyway.
Steps you’re probably already doing
3. Clean your data to avoid a misleading model
Your institution is likely cleaning data to some extent, but modeling may introduce a need for wider-reaching data cleaning. The fields directly included in reports and dashboards across campus are likely to be in good shape (which is to say there are no unexpected missing values, labels are correctly coded, and so forth). However, predictive modeling is generally a search for predictors of outcomes that you might not have known about. Thus, you might need to start exploring fields you don’t report on, and those might not already be part of an existing cleaning process.
Imagine you have a hypothesis that digital and campus transaction data (e.g., dining hall usage or book-store purchases) is predictive of successful student outcomes. The data might need cleaning and formatting before it can easily link to students’ other data.
Newcomers to predictive modeling should still take heart, though. Data preparation may be 80% of the work in a modeling project, but much of the data you’ll be relying on is cleaned for other end-uses, and what remains is only a fraction of the total.
DOWNLOAD A DATA QUALITY CHECKLIST
4. Create new, useful variables to understand your records
My claim that you’re probably already creating new variables is based on institutions’ perpetual efforts to create and refine useful reports. One example might be a flag for a student taking a lab-science course (or any other required course). I once worked at an institution that required all students to take at least two lab-science courses, which were historically the most constrained for capacity. To ensure that students weren’t working themselves into a bottleneck where they couldn’t find an open lab-science course, we created a new variable—a flag for any such course—and tracked it. That helped us to recognize which students still needed their lab-science courses and encourage them to fit them in.
Here’s the turn, though—that same flag is a stellar candidate for a predictive model targeting retention or on-time completion! The lab-science flag is very specific, but you (or someone else at your institution) may have your own custom creations to bring into a model.
Another redeeming quality about this step? It can be genuinely fun. Thinking of new variables to predict a critical outcome is a creative process. Data analysis involves plenty of mechanical, objective tasks—subjective, creative, and contextual problems like “What else might help us understand this outcome?” are gems.
If you don’t have time for this step in your first pass at modeling, don’t worry—you can build a model without excesses of new variables, and revisit this step in successive iterations.
Steps you’re not already doing
5. Choose a methodology/algorithm
This step is the most exciting to write about. There is a wide, wide world to explore once you start learning more about predictive model algorithms. At the same time, it can be surprisingly easy to enter this phase because there are droves of resources available.
While there’s no one best source for all people to learn any given concept, I recommend that you start by searching through publications, forums, and news sources specific to your use case (AIR Forums, NACAC, NASPA, and other consortia specific to your focus). It’s tempting to start with statistics-first, use-case-second-style sources, but that may leave you inundated with information irrelevant to your intended usage. In finding use cases and references to methodologies in peer-reviewed locations, there is an effective guarantee that the methodology or algorithm you discover is proven.
From there, you can explore less use-specific sources for knowledge, like YouTube compilation videos, StackOverflow, and data science blogs.
6. Build the model
Everyone who builds predictive models today uses an application to do it, whether it’s open-source, a licensed software, or a homegrown tool. So, when you hear about advanced algorithms or read blog posts that reference dozens of steps, don’t fall under the impression that you will need to perform them manually. Tools are the single-most influential enabler of predictive modeling in the recent past. The rapid development of statistical software has introduced an application designed for any user. Despite that, time-tested solutions exist, and using one with a track record can alleviate concerns you may have about modeling without an extensive background.
For that matter, while a background in predictive modeling will naturally benefit you when you’re building a predictive model, data analysis and its professionals are distinctly collaborative. Accomplished modelers are everywhere across the internet sharing their stories, caveats, and best practices. Even cursory searches for “how to” resources return a surprising variety of use cases, so there’s a great chance that you’ll find a resource that runs parallel to your needs.
HOW TO PICK THE RIGHT COLLABORATORS TO ADVANCE YOUR DATA STRATEGY
Predictive model building may be net-new work, but it is within reach. I’ve had the fortune to support the implementation of Rapid Insight software in offices that made it abundantly clear how unfamiliar the practice of predictive modeling was to them. All the same, in mathematics, we stand on the shoulders of giants. The statistical theory behind predictive modeling is now (in many ways) automated through software, leaving it more accessible than ever before.
In summary
Yes, predictive modeling involves a few steps you aren’t taking yet. However, the idea that you need to start from square one is a misconception. Predictive modeling is not the process of collecting, cleaning, organizing, or augmenting data. Instead, it is the process of analyzing data. That means that the data you have on hand right now is more ready than you might think for predictive modeling.
You can always find improvements by refining your data cleaning process, or the variety of fields you create to enhance your data. However, I hope that you take this away: to “get started” with predictive modeling, you need only slightly expand on the work you’ve already done