Should you share your data?

Seems the big scientific brouhaha at the moment is PLoS’s recent (clarifications to their) policy that data for a paper will be shared. In my field, the answer to the title of this post is, “OF COURSE!”

However, I get that there are different cultures, that vary by field, about what kind of data sharing is expected, and how much credit should be given to those who share data (citation, certainly, but what about authorship?). As has been discussed, there are also “lots” of corner-cases about where and how exactly the policy does or should apply. My guess is that PLoS actually left this intentionally vague, so that Editors can use their judgement (although hopefully they have been trained on what exactly the policy is meant to do; I can’t find the tweet that suggested this).

But someone might scoop me on the analysis

I asked my dad, who is a medical doctor spending just a fraction of his time on research, how he felt about this argument. There are lots of pressures on his time aside from doing analysis on data. His funding for his time spent on research does not come from taxpayers. If anyone should be sympathetic to this argument, it should be him. His response? “Well, you better do all your analysis the first time then!”

If you’re going to hoard all your data for the day, years down the line, when you might publish your analyses from it, then there’s a chance you could get hit by a bus, have the data get damaged, or otherwise just not ever get around to it, despite all your best intentions to publish it. I get that not everything people are doing directly translates to life-or-death decisions, but if there’s scientific insight that might be gained from your data, how is it not wrong to slow the progress of that insight?

Furthermore, it seems to me if the tacit understanding in your field is that “when data are shared, along with some other substantial contributions, that’s standard grounds for authorship”, then it seems to me that it would be a breach of publication ethics to not include the source of the data as an author. The key point of “peer review” is that it’s done by your peers, and if it really is so unusual to have someone provide data without being a formal author, then you should trust your peers to catch that.

It’s too hard to put my data into a format that people will use

This may or may not be a corner-case, but a lot of time, you could just submit the excel file, data table, whatever other form the data is in zipped together as “Supplemental File 1. Data collected in 29 different files, separated by the fleezle criteron”.

My favorite source code license (under which I’ve released my processing/analysis code) is the Community Research and Academic Programming License. It’s obviously not designed for releasing data (it’s even unclear whether scientific data is subject to copyright protection and therefore licensable), but I think something very much like it could be useful for releasing data, especially in assuaging fears that it might be ugly and not in a totally pristine format. I might grumble when people’s released data tables are in a terrible format, but I actually curse them when it’s just plain not available.

What’s the “right” thing to do about data sharing?

There’s two questions that are tied up into this one. First, what do we want the world to look like? And second, only once we know where we’re going, how do we get there?

For the first question, I think almost everyone will agree that, all else being equal, more science and better science will get done the more free the data is. You want more, better science, don’t you? Of course, that’s only true if people feel they can be appropriately compensated for the effort they go through to make and share the data. I think if we can assuage people’s fears that their hard work will go properly recognized by funding bodies and hiring, tenure, and promotion committees (in almost all cases, composed of your scientific peers), then we can probably get them on board with freely sharing their data.

How we get to that kind of world is a whole ‘nother question. You might disagree, but I think PLoS is on the right track. As Gandhi probably never said, you must be the change you wish to see. PLoS is sticking itself out there, as it has done in improving other areas of science publication. Will this totally fix the apportioning of credit in every field? No, but I think the discussion about it will help bring the issue to the fore, so we can at least start moving towards that world. “Datasets generated” isn’t currently a standard section on a CV, but should it be? I think so (and may go ahead and update mine now).

Disclosure: My PhD supervisor, Michael Eisen, is a co-founder of PLoS and on it’s board of directors. I have not spoken with Mike about the content of this post or the “new” PLoS policy.

2 thoughts on “Should you share your data?”

  1. It seems you’ve missed the central point. What public archiving does is allow people to use your data without including you as an author! That’s what happens. You just get a citation, and no authorship. This happens now all the time.

    1. From my perspective, getting a citation but no authorship is as it should be. If you didn’t contribute to the design or analysis in the study, then you aren’t really an author. If I’m writing an essay on “The Old Man and the Sea”, I don’t make Hemingway an author on that paper. If your data is getting used heavily, I think it’s totally fair to point it out that re-analyses are happening, and take some share of credit for that, since it’s fundamentally different from the more common type of citation, where “Jones et al found that snerks consume bleezles[Jones, 1929].” Just not as much as if you had contributed to that reanalysis.

      That said, I can point to examples (even in the field of genetics where you no longer “own” the data after publication) where key authors of the original data that gets reanalyzed are included as authors on the reanalysis, presumably because they contributed their insight in a meaningful way.

Leave a Reply

Your email address will not be published. Required fields are marked *