Skip to main content

Video Recap: Data Management Landscape Full Transcript

[First slide of data management presentation]

ELLIE: My name is Ellie Dwarack. I am a faculty member in the library, and I’m the research data librarian here at Boise State. My background’s in the humanities and social sciences, and I have an English bachelor’s degree and a master’s in Library and Information Studies. For many years I was a reference librarian helping mostly undergraduate students with doing Library research. I’m going to get a glass of water, and then I’ll tell you that thought, so I mainly do qualitative research in mixed message research, but I get to work with professors and researchers across campus to write data management plans. Who hear has heard of a data management plan? Just so I know how much to cover. Cool, okay, so It’s really fun. I’ve helped professors in World languages, education, physics, and chemistry. I get to learn about all sorts of interesting stuff going on.

[Attempting to go to the next slide]

Okay, how does this work… what do I do?

ELLIE: [Went to next slide]

So here’s today’s agenda; I’ll start with an overview of data management and what it is. Then we’ll take a look at the various campus units that have a stake in good data management and what their concerns are. I’m going to share some research I’ve been doing into data management policies at other universities and how they address some of the concerns that I’ll mention, and then I will end with a call to action. So

ELLIE: [Went to next slide]

a basic definition of data management is the organization, storage, preservation, and sharing of information. Who here, based on that definition, thinks they do data management? Yeah, all of you perform data management because it’s just the everyday tasks of doing any job in large part. So naming files and organizing them into folders, managing Version Control, which has gotten a lot easier with Google Docs, but you all know when you’re working on a project with a lot of people that it gets complicated.

ELLIE: [Went to next slide]

Okay, so before we get into what data management is in a research context, let’s talk about what research data is. So the code of federal regulations defines research data as “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.” So this includes both the original data that a researcher produces and any metadata, such as experimental protocols or software code written for analysis, that are required in order to reproduce the study. So for NSF programs, there is often an educational component that they require for any project, even if it’s, you know, hard science that they want you to be working with graduate students or undergraduates, and anything that’s produced in the course of that educational component is research data. So that might include assessments of students or curriculum materials, that kind of thing, also slides and videos. What is research data not? It’s not preliminary analyzes, drafts of scientific papers, plans for future research, peer reviews, emails with other professors, colleagues, or physical objects such as lab specimens. It also doesn’t include Trade Secrets or personal information that is a clear invasion of privacy, such as information that would allow you to find somebody, as it doesn’t take that much information to find somebody, as it turns out.

ELLIE: [Went to next slide]

Okay, so why does this matter? Well, there’s a dear colleague’s letter that was written in 2022 by the NSF’s directorate for mathematics and physical sciences, and… the point here is the reason that the definition of research data is so important is that Federal funders require data sharing and that end requires researchers to submit what’s called a data management plan with funding proposals. So your colleague’s letter says data sets underpinning published research findings are expected to be shared with other researchers. They’re not supposed to charge you a bunch of money for it, and it should be shared within a reasonable time frame, and that’s a shift for researchers. Certain disciplines have been sharing data for a long time for instance, computer scientists are really good about it. In other cases, It’s a new approach. So the letter goes on to talk about the benefits of data sharing, enabling broader research collaboration, facilitating transparency, solidified confidence, scientific research, providing increased resources for teaching and education purposes; and recent Studies have found that research articles containing a link to the data have markedly higher usage and visibility.

ELLIE: [Went to next slide]

So here’s a definition of research data management then; “the organization’s storage, preservation and sharing of data collected and used in a research project”. So the organization piece is a lot of those things I just talked about, naming files in meaningful ways, such as including the date and or aversion and what the file is about. A lot of you have probably used to doing this kind of thing. Also, it’s creating documentation of your research, so for instance, protocols README text files, necessary for other people to be able to understand what you were doing. So storage, as we’ve learned, is about keeping your data safe from unauthorized access to hardware and software failures. I’ve had faculty members tell me they just keep their data on a flash drive, and that makes me nervous. What if you go through… I don’t know if TSA or races flash drives, but I would be worried about it. We could ask Mark if he’s not here anymore; oh well. So preserving data isn’t just as easy as saving it. Technology changes quickly. So you have technological obsolescence, and that’s hardware and software. I once had an archivist tells me that when cockroaches rule the Earth, microforms will be the only thing that they can use, right, because they can’t use a computer. All you need to read a microform is a light source and a magnifying glass. Has anybody used a microform ever? Do you know what it is? It’s like a negative or something kind of. They look like they come in reels, and you have to read them on a machine, but theoretically, you could just use a light source and a magnifying glass. It’s considered the best kind of archiving. It’s horrible to use, and I just always picture cockroaches with a magnifying glass, you know, when I think about it. So for the reason of technological absences, among other things, both the National Science Foundation and the National Institutes of Health required that data be submitted to a qualified data repository. Data repositories will have staff who are experts in curating your data or your client’s data, and they should have plans to transition the data if they go out of business and a prudent reserve so that they can keep the data for a certain number of years while that transition happens. So preservation makes it possible to share data with other researchers and you know that’s also about the choice of file formats and software, so you know you don’t need to use open-source file formats or software when you’re doing the research, but when you store it and preserve it, curate it, you want it to be available to as many people as possible. Repositories can also assign a digital object identifier to data sets which makes it citable, and researchers get credit for their work. Okay, why does it matter? I should have probably led with that;

ELLIE: [Went to next slide]

my husband told me I should lead with that. It saves time and effort and improves the reproducibility and replicability of research; as we know, we’re having a reproducibility crisis in this country, especially in the social sciences. Yeah, data sharing increases, collaboration supports new discoveries, etc. Also, data management plans which are documents outlining how you’re going to manage your data, are a required component of federal funding organizations and many private funders. So researchers need money to do their work, and you know research computing needs money in order to store the data, and so we have to do what the funders say, right? It can save researchers the embarrassment of having a paper retracted. Here are a couple of horror stories I found on retraction watch.

ELLIE: [Went to next slide]

The first one, the authors didn’t ask a patient for consent to do a case study, then they couldn’t reach the individual, so they went ahead and published the case study without consent, and the patient saw the article and asked for it to be retracted. So my guess here is that they didn’t put procedures for data management in place at the start of the project. They didn’t think it through. Next one…

ELLIE: [Went to next slide]

Ok, so this one is for the Journal of Psychiatric Research. There weren’t a lot of details, just that the authors requested a retraction because they analyzed the wrong data. Which is crazy right? So my guess is that they didn’t give enough attention to file naming and Version Control or organization of their data sets.

ELLIE: [Went to next slide]

So the Chips and Science Act has a section in it titled research reproducibility and replicability that is about funding for the NSF, NIH, and other Federal funder organizations that fund research. The Chips and Science Act wants to make sure that research is reproducible and replicable. So they want a machine-actionable data management plan. So typically, in the past, data management plans were one or two-page documents in paragraph form like free-form text, right? Machine actionable documents are structured with XML or JSON, and they allow data and information to be communicated, shared across stakeholders, metadata linking repositories and institutions. So you won’t be able to read this next one,

ELLIE: [Went to next slide and skipped Data Horror Story slide]

but this is a network diagram showing the connections emanating from the center machine actionable data management plan. So this allows researchers to file collaborators, generate statistics it, supports data management monitoring, and helps to integrate workflows across multiple research tools and systems, among other things. Anything machine actionable is going to be easier, of course, with AI, I don’t know how… well, it’s required by the Chips and Science Act, so it doesn’t really matter whether or not we need it.

ELLIE: [Went to next slide]

Okay, the research data Alliance has created a JSON standard for machine actionable data management plans, and this image shows the top-level structure. You’ve got the data management plan there in the middle, and then there are properties about within each of these, so for instance, the contributor… I don’t know what to call it section, property, anyway; it includes properties such as a name, an email address, an orchid ID, Etc, and where you work; that kind of thing.

ELLIE: [Went to next slide]

Here we have a snippet of what a machine actionable data management plan would look like in JSON. Not all researchers are going to be able to do this, right? So we’ve got the GMP tool. Has
anyone heard of or used

ELLIE: [Went to next slide]

the DMP tool? This is just a tool that allows researchers to write a machine actionable data management plan. They could even share it and generate a DOI for the data management plan if they’d like, which is a service to other researchers because it’s great to have examples. So here’s an example of the simple interface with a template, and you could generate templates based on different funder requirements with the NSF’s different directorates. This is an example of what just a piece of what this tool looks like using the NSF division of Earth Sciences template.

ELLIE: [Went to next slide]

So we talk about data management mostly for the hard Sciences but it’s also important for social science, statistical data, ethnography, humanities text, and anything that’s collected in the course of academic research.

ELLIE: [Went to next slide]

Okay, stakeholders and concerns, so most institutions like ours have very distributed research support services, and above, I have a fairly comprehensive list of the entities on campus that have a stake in good data management. Let’s take a look at each of these, and as I go through the list, I’m going to ask if you know whether you are a member of each of these groups and what I’ve missed. I hope you play along.

ELLIE: [Went to next slide]

Okay, the university itself, which I suspect were all part of, is responsible for good stewardship of its assets, including investments in research… human material, Financial assets required to conduct research, and also universities are rightly concerned about funder and Regulatory Compliance. In addition, the University’s mission is to advance knowledge, and good data management and sharing of data are important to that public service mission. Is anybody here at the vice president level or above? Alright.

ELLIE: [Went to next slide]

So the Office of Research, or the division of research is it’s called here, is responsible for maintaining good relationships with funders and for compliance with funder requirements, including the data management plan. The office administers budgets for funded projects, and they can help you determine if you could include data management expenses, such as somebody to manage the project into a funding proposal and finally, the Office of Technology Transfer, which is part of the division of research, makes sure that the university and researcher claim their intellectual property rights, usually in the form of a patent. Does anyone here work for sponsored research or the division of research tech transfer? No? Okay,

ELLIE: [Went to next slide]

research compliance is actually part of the division of research, but I’m talking about them separately because they are a big stakeholder. They’re responsible for making sure that researchers have a plan to meet their ethical obligation, especially to human subjects and animals. This office also keeps an eye on all the regulations which changed quickly that relate to research and for being sure that we at the University are meeting them. That our policies are in place and are accurate. So in some large institutions’ research, compliance is also responsible for ensuring compliance with data management plans, and here, there’s nobody making sure that you comply with your data management plan. Although sponsored research know that you need to report out, so there is that, but they’re not necessarily checking on you, right? Does anyone work for research compliance? All right…

ELLIE: [Went to next slide]

So a consideration here is that as with research Computing, research compliance communication is key. So anyone here ever written an IRB protocol? That’s what you write if you’re working with human subjects, and keep in mind that even in the hard sciences, for some of these NSF grants, you have an education component. So you may have never done human subjects research, but you are now, right? So work with the staff on research compliance to protect the privacy; you know to put measures in place to protect the privacy of participants because gone are the days when you can just say, in your data management plan, that you’re not going to share because of privacy concerns. You need to think about it ahead of time.

ELLIE: [Went to next slide]

The office of IT, or OIT as we call it here, is responsible for data storage. I forgot to ask; oh no, I did ask, okay. Data storage security integration, mining visualization, and other information processing services. So obviously, this is a fast-moving field, and there’s a lot of planning and Hands-On management required for this. Sub-research requires a lot of server space or has special regulations such as HIPAA data so data retention schedules and planning for extra security are of concern as our cost models. So here at Boise State, every faculty member gets a certain amount of space on the Research Computing high-performance computers, but what if the researcher leaves the institution, and the data needs to remain? So that’s where cost models are really important. Who here works for the office of Information Technology? Did I miss anything? No? Okay.

ELLIE: [Went to next slide]

Your biggest stakeholder, perhaps, are researchers because they generate the data, and their career advancement often depends upon their research and their research output. So this group’s highly motivated to meet funder requirements including a data management plan. They also want to use their own data so that’s also a motivation, right? If you use the wrong data, you might have to retract a paper. So obviously, publication and other research outputs are an essential concern for researchers, and back to the idea of data sets can have DOIs so that they can be easily found and cited. Also, most federal grants do require that data be open, so you can’t sign away your rights to a publisher. Sometimes like these Elsevier journals, we call them the European Bandits, they’re very expensive journals, and they want to retain the right to data, and Open Access isn’t really all that open because it usually costs money to publish Open Access, so I encourage researchers to write Open Access fees into budget requests and obviously, ownership is important to researchers because they’re so close to the data and have such a high stake in it and so if they leave the institution, who owns the data technically well we’ll look at that in a moment. Check on the time; okay, research teams, is anyone here a researcher? Yeah, did I miss anything important no, it’s early, okay?

ELLIE: [Went to next slide]

Okay, a consideration for researchers is their research teams. So obviously, the research teams, and the members of it are also researchers, but for principal investigators, PIs, or project managers, you got to make sure that they know the expectations for data management and that they have adequate documentation and training. So what to put where and when, and how to name it, for instance.

ELLIE: [Went to next slide]

Another consideration is outside vendors. So there are vendors that provide services like running a survey or transcribing videos or so forth, and obviously, these businesses are in business. So they want to make money, and they’re incentivized to cut costs wherever possible. So it’s important to write critical data management tasks into a contract language and create validation measures to be sure that they’re followed. I learned when my house caught fire in 2013 that you want when you hire a contractor to write in the date and how much money comes off the total amount that you pay them for each day. Learn that later, unfortunately. So

ELLIE: [Went to next slide]

academic units, such as departments and colleges, are also concerned with data management, and some of these entities have research support services such as proposal writing, administration budgets, and tracking workflows, and these academic support staff work with researchers closely. So this is a good group to work with if you need to communicate to researchers about data management requirements or really anything else. Is anyone here who’s a research support staff person? Yeah, did I miss anything? All right.

ELLIE: [Went to next slide]

That’s my favorite picture in the whole slide deck. Libraries at most research one and many research two institutions have a unit that provides support for data management planning, and in order to provide that support, we keep a close eye on the funder and program requirements. Which is kind of a task because there are so many divisions and directorates within the NSF, and there are many funders, the Department of Justice, etc. So institutional repositories, which are often run by libraries, provide support for curating and publishing data sets, and finally, librarians have a professional obligation to advocate for equitable open access to information. It’s what we do.

ELLIE: [Went to next slide with horizontal bar graph: y-axis provides values library and IT department. The x-axis gives three values of the average number of research data services. For R1, the library has a value of 2.4, and the IT department has a value of 1.2. For R2, both the library and the IT department have values of 0.6. The last value, selective liberal arts, has a value of 0.9 for the library and 0.6 for the IT Department.]

So unless you be surprised by the fact that libraries are in the data business, I thought I’d share this chart that I created based on a 2020 study by a research group called Ithaca SNR. Libraries offer the greatest number of research support services at institutions and especially at research-intensive institutions, followed by the I.T Department. Questions before I step into the next section of the talk? All right, we might finish up a little early then.

ELLIE: [Went to next slide]

So I’m going to talk about data management policies at research institutions. I looked at 48 universities that are all classified as Carnegie, very high research activities activity as a designation that’s below research extensive. I believe which is where you’re talking about the research one institution, where you would call a research two institution, as are these other groups and I looked for a policy that addressed research data management directly. Now the reason I did this is because, you know, I was talking to people in other units that support researchers and there’s a concern, especially with this, there’s new NIH guidance, and the chips and science act that Universities might be required to oversee data management plans. I don’t think that’s going to happen, at least not right away. So only about 42 percent of the institutions had such a policy, and of course, there were policies that touched on one aspect of data management like data retention or the classification of data for the purposes of security, that kind of thing, and then oddly, nine institutions had a data management policy for administrative data and student data but not for research data. Which I found interesting and odd

ELLIE: [Went to next slide]

so I also looked at who or which office was responsible for the administration of the policy, and I could only look at 19 of the 20 because California Institute of Technology keeps their policies behind a firewall. For most universities, it was the division of research that’s about 68%. The office of I.T. or the research compliance office tied for a distant second 11% each. In one case, the library oversaw the data management policy, and in another, it was the offices like a leadership team from the Offices of Research Information Technology Clinical Affairs and the deans of the several campus libraries. I don’t know how they get anything done…

ELLIE: [Went to next slide]

I looked at the topics covered by these policies, and you could see what topics I found. Data retention and funder compliance were the most commonly addressed topics, followed by data security, regulatory compliance, and the rights of project team members, including students, to use the data. I hadn’t thought about that, but as a student, you should have protections that you can use the data to publish if you worked on it, and that’s what these 11 institutions did. So only five of the policies talked about the importance of sharing data and the obligation to share research data. I find the lack of uniformity here to be really odd, like only the highest percentage was 21. Of course, this is small and more of a survey rather than a study. This is a small number of Institutions, and also, there might be another policy that covered the topics, but still, I found it interesting you could kind of tell who had a seat at the table by looking at the topics that the data management policy covers, right? The Office of Information Technology and the Division of Research were there, and maybe not so much the library because of the very few policies that cover data sharing.

ELLIE: [Went to next slide]

So I was curious about what happens to research data when you leave a university because the university technically owns everything you do… in most cases, the university does own the data but provisions are made to transfer it to the researcher’s new institution, or they allow the researcher to take copies. In a couple of cases, the researcher owns the data but is required to give the university access if necessary, and in one case, the policy was written so that they might allow researchers access to their data, but maybe not, especially not if there’s a patent that they’re making money on. So I’m going to talk about services available at Albertsons Library next, but I’d like to ask if you have a question about this information I just shared or comments? I’m planning to expand on it because I’m just really curious. Also, I would be surprised if these policies aren’t shifting and changing maybe more being written because the NIH recently came out with an updated data management sharing policy that got a lot of Institutions concerned, again, and also the Chips and Science Act. So

ELLIE: [Went to next slide]

what can the library do for researchers? We can help write data management plans, help researchers use the DMP tool; we could help with costing for data management if, in the case that a researcher wants to write that into their budget for a funding request, we run the institutional repository, which is called Scholar Works and we can curate data sets, we can mint DOIs and your data will be preserved in many copies. Well, many backups are made, and we have a plan for well, the vendor that we use for the interface has a plan for what happens if we go out of business… and with Leah’s descriptive metadata, it will create an item record and use keywords that will help get people to your data and will enhance discoverability. If researchers have questions about copyright and license agreements, we can help with that, and I can consult with faculty about research impact and especially Altmetrics, and then for students or sometimes for researchers that can help identify and access data sets.

ELLIE: [Went to next slide]

Okay, here are some of the materials I reviewed when preparing this presentation; if you’d like this slide deck, shoot me an email. I’m also happy to share my research into institutional policies with anyone who’s interested.

ELLIE: [Went to next slide]

So here is my call to action. I’d like everyone, if possible, to think about how the library could engage, and how the research data library and especially the research data management group can make a difference here on campus. What can we do? Who do we need to talk to and perhaps it’s you, and I’d like to hear from you if in the coming months about that, so that’s my really small call to action. Any questions? Was I boring? Okay, well, I really enjoyed creating this presentation, and I hope you enjoyed it as well. Thank you so much.