KL: Katie Lindersvt: Steve Van Tuyl KL: You Re Listening to Research in Action: Episode Eleven

Episode 11: Dr. Steve Van Tuyl

KL: Katie LinderSVT: Steve Van Tuyl KL: You’re listening to Research in Action: episode eleven.

[intro music]

Segment 1:

KL: Welcome to Research in Action, a weekly podcast where you can hear about topics and issues related to research in higher education from experts across a range of disciplines. I’m your host, Dr. Katie Linder, director of research at Oregon State University Ecampus.

On today’s episode, I’m joined by Steve Van Tuyl, the Digital Repository Librarian at Oregon State University, where he manages the university’s institutional repository, ScholarsArchive@OSU, and participates in providing research data services to students and faculty. Prior to his work at OSU, Steve was a Data Services Librarian at Carnegie Mellon University and a Reference Librarian at the University of Pittsburgh. In a previous life, Steve was a Biologist with the USDA Forest Service, conducting research on disturbance impacts on forest carbon cycling.Welcome to the podcast, Steve.

SVT: Thanks

KL: I’m so glad you’re here to join me. I had met you because I went to a workshop that you gave on data management. First of all, I’d love to start by talking about data management and what it is, because it seems kind of like a mythical creature. We have so much data now, how do we manage it? Is it even possible? But I love this idea of data management. Tell me more. What does it mean?

SVT: Data management is kind of a broad set of activities researchers take on when they start doing research. Really, anybody who’s doing research is doing data management because you have to do something with your data. Data management involves everything from identifying what data you have in a research project, who’s responsible for the data at the various different stages of the research project, and how you document and share your data to other researchers, either in your research group or to researcher who are interested in acquiring your data for other purposes. That’s kind of the broad definition of data management.

We usually try to tell people what we don’t mean by data management, also. So we’re not talking about things like analytics (how to analyze data), we’re not talking about database management (which is a very different creature), and we’re also not talking about how to design a research project. Really, those types of things are what researchers know how to do well for their own research domain, and data management, more broadly, is kind of a set of tools and activities you have to do no matter what project you’re working on.

KL: So it seems like data management, to some degree, has gotten a little more complicated because we have several people that could be involved in helping you try to think about the boundaries around your data management. I think immediately of IRB and how now you may have to include a data management section in your IRB proposal, whereas maybe that wasn’t something that was there before. Or if you have grant money, there may be some requirements around data management or data sharing. Is this something that someone in your position would help researchers navigate, all these different kinds of regulations that might be placed on data from a research project?

SVT: Yeah, typically the work that we do with researchers when we start talking to them about research data management is to try to cover all of those externalities first, so to try to understand where they’re seeking grant funding from, or where their grant funding already comes from, and whether they have IRB-related issues or other research-ethics-related issues that they need to be dealing with. Then, very quickly, we can either identify what a funding agency is requiring them to do or not do, or point them directly to IRB and say, “IRB knows way more about IRB stuff than we do. That’s why they’re the IRB and we are this other set of services.” But understanding that landscape of what a researcher is doing helps us provide them with better guidance than we might have been able to if we didn’t understand that landscape ahead of time.

In some ways, I don’t think it’s actually more complicated. I think that the data management landscape has become more regulated, but that has made things a little bit easier in some ways. Because at least you know, as a researcher, what’s expected of you, whereas 25-30 years ago, if you told someone to write a data management plan, it might have been hard at that time to just come up with something, because you wouldn’t know what the elements of that plan might be.

KL: One of the things I’d love to talk about are how researchers can set themselves up for effective data management. When I had attended this workshop that you gave here at Oregon State, you talked about granular things like how you label your files, which I’ll just admit, that kind of thing fascinates me. I’m the person who wants the organizational structure that is going to be the most efficient. But you also talked about things like effective data storage and backup, which I also think is crucial for researchers who especially have datasets that cannot be replicated and that really need to be stored and backed up in a way that’s really helpful.

I’m wondering if we can just chat a little bit about, what are some of the most important ways that you think researchers can set themselves up to be really effective with data management?

SVT: One of the things that I would recommend, and you mentioned this a little bit, earlier, I would recommend giving your storage and backup solution a really… spend some time thinking about what you have in place for storage and backup. That’s the kind of thing that can very easily get lost in the fray of doing your research. If ahead of time, or periodically, you step back and look at what you have in place for storage and backup, that can save you a lot of headaches down the road. We have a whole list of data loss disaster scenarios.

KL: I think everybody’s heard some horror stories from somebody about data that’s been lost.

SVT: And as soon as it happens to you, which it happened to me a couple of times, it suddenly becomes real. It’s kind of like how you pay your insurance company for insurance and it doesn’t really make a lot of sense until, suddenly it has to. So I would say that’s one concrete thing to do is to be very intentional about your storage and backup solution.

A second thing that I would recommend is to think about documenting your data from the beginning, so thinking about… for somebody who may want to use your data in the future, that might be you. Actually, in most cases that’s the researcher themselves. Or a graduate student that you bring on, or a colleague that you’re sharing your data with. That documentation is going to be so critical to their ability to use the data and understand what it is. In more of a practical way, you spend less time explaining things to people over and over again when it comes to understanding your data.

KL: I heard this term bandied about quite a bit, “metadata.” Is that what you’re referencing here? If so, could you tell us a little bit more about what it is?

SVT: Yeah, I guess its metadata. Metadata is structured information that helps you understand what a dataset is or what a “thing” is, and how it might be used, and where it comes from, and all these different types of relationships that digital objects have, essentially. Typically, I think, when people think of metadata, it’s this lurking monster. There are lots and lots of metadata standards out there that your research domain may or may not use or recommend. Metadata oftentimes is encoded in something like XML, so it looks very complicated on the surface to users.

So what I try to do is to put that idea of metadata away and say, “What we really mean when we say ‘metadata’ is that you need to provide sufficient documentation for your data so that somebody can understand what it is, understand the context that it came from, so even an abstract from your project proposal provides enough context that somebody knows why the data exists in the first place, and the metadata should provide some description of how the data elements in your dataset were collected or created, or whatever it is.

Many, many of us have used other people’s data in our research. Those of us who have done that have a pretty good understanding of how difficult it can be to understand what the data is. Good documentation, while sometimes rare, can go a long way to helping people know what the data is, but also to not misuse the data because they didn’t understand the methods that were used to create it or collect it in the first place.

KL: Absolutely. Sometimes when I think about metadata, I think on a much smaller scale about things like citations, and this is why we care about citations being correct because you need to be able to go back and find out what this was about, if someone’s referencing it or using it as an example in an article or in a research study. I think metadata is clearly more than that. It’s more detailed, there’s a context involved, like you were saying. But I feel like it’s one of those areas of research that could be kind of a pain, like citations can be kind of a pain, but getting them right is still critical to communicating your research and making sure your research, or your data, gets used by other people in the ways that are going to be most effective to furthering the field.

Well now that we have a nice strong foundation about what data management is, what it looks like, what some of the component are, data management plans can be pretty complex and have lots of moving parts. We’re going to take a short break, and then come back and talk about those in segment 2.

Segment 2:

KL: So Steve, data management plans. I’m seeing them everywhere on various applications I’m working on. I am seeing them requested by places like IRB, but also clearly grants and some other kinds of things I’ve been looking at recently. Can we start by just saying, “What is it?” and maybe, “When did it start becoming a thing to have a data management plan?”

SVT: A data management plan is usually a broad description of what data is going to be collected and what you intend to do with the data to meet the requirements of some “asker” (an entity asking for a data management plan). That entity may be you asking yourself what you’re going to do with your data.

KL: One of the things that I’m curious about with data management plans is, “What is their origin?” Some researchers might say, “You know, I know how to handle my data. Why do I have to give this data management plan? Is it because there’s so much data?” Where is this coming from, these askers who are asking for these data management plans? What’s the motivation behind that?

SVT: One way to think about the motivation and where this is coming from is about return on investment for grant dollars. The specific dates always escape me, but right around the turn of the 21st century, NIH started asking for data sharing plans, I believe they were called (or are called), for very large grants, so $500,000 or more. In some ways, that may be one of these origins of a data management plan. I think we were seeing some requests for data management planning, very specifically before that, for very specific research domains or grant opportunities for very specific types of research.

Even going back to the 1960s we saw NASA talking about data management as an important thing because they realized they were collecting all of this data and started offering guidance on what needed to be done to make sure this data was usable into the future.

More recently, though, going back forward in time, probably about 5 years ago NSF started requiring a data management plan for basically every proposal that came in for an NSF grant (although there are some exceptions, I think). What those data management plans look like is kind of what we’re starting to think of as a normal data management plan for a grant, which is a two-ish page document that lays out these different components of what a data management plan is and says that you intend to do as far as data management is concerned.

In 2013, the office of science and technology policy of the White House took what NSF did and said, “We intend that every agency with over a certain amount of R&D money is going to require this type of data management plan, also.” There were some other elements of that mandate that came out around data sharing and publication sharing, but a data management plan was a big component of what they were asking for.

KL: So this seems like it’s a… I mean it’s definitely a thing now. It doesn’t seem like it’s going to go away anytime soon. Actually, I have to say, for things like this, I kind of, to some degree, appreciate it. It’s forcing a level of planning and intentionality with data that I welcome because there’s so many different components of research that I think you don’t want to lose sight of the impact that long-term planning can have on things like data sharing. But let’s talk a little bit about, “What are the typical things that are included in a data management plan? What do people need to be thinking about as they’re preparing this document?

SVT: When we think about a data management plan, and especially one of these short-form data management plans that funding agencies are asking for, there are a handful of things that we try to make sure people are including in those plans.

The first thing is, “What data are you going to actually produce for a project?” That kind of sets the groundwork for what needs to happen subsequently.

The second thing is about how those data are going to be handled during the project. Are they going to be backed up and stored securely? How they going to be passed back-and-forth from site to site (if you’re in a multi-institutional research project)? Dealing with those operational elements of data handling is also important.