The Background on Your Bytes

A large blue diagram fills the computer monitor in James Frew’s office at the University of California, Santa Barbara; it’s a graphical representation of the life history of data used to create a map of ocean color around the world. Several tightly spaced vertical lines run down the left side of the screen, illustrating the flow of information toward the end product. As he scrolls down, small boxes containing data source labels come into view on the right side. Horizontal lines lead from the boxes to the vertical lines on the left, mapping how each boxed element, or data source, feeds into the main flow.

“In a sense, what I’m doing is pointing to this ocean-color result and saying, ‘Where’d it come from?’” explains the professor of geoinformatics. “This prototype system is designed to apply a set of rules to the provenance so that it goes all the way back to the start. And at each point it asks, ‘Is this good?’”

Frew tracks down through the boxes, evaluating each as he goes: “In this case it’s good because the file name happens to match a pattern we trust, and the file hasn’t been modified since it was put there. This one lives in a standard place and is part of the operating system, so I trust it. I trust this next one transitively because it was created with verified data and a verified process. And I trust this because it’s under the control of a revision system, and whoever created the program hasn’t modified it since, so I trust it because I trust that control system.”

Finally, using his index finger to draw a circle in the air around all the boxes on the right side, he says, “Because I trust all this stuff, I trust the ocean color that came from it. We have the provenance and then we go back through it and apply automatic rules to decide, the idea being that if we trust all the antecedents, we will trust, by derivation, the end result. That’s provenance at work.”

Provenance is an old concept, useful in examining valuable objects like fine wine or old books. In the art world, for instance, it refers to the past of a painting, a sculpture or other object — that is, the chain of ownership used mainly to determine authenticity.

“The minimal amount of providence in art is, ‘I bought this from Sotheby’s, and they’re reputable. So I trust that it’s not a forgery,’” says Frew, a member the provenance working group within the World Wide Web Consortium (W3C). “The ‘reputable’ part is important because, lacking that, you need a lot more detailed information.”

Tim Burners Lee, inventor of the World Wide Web and leader of the W3C — the ever-evolving entity responsible for introducing such fundamental Web architecture as hyperlinks — is overseeing an international collaborative project to develop a universal language for creating, storing, reading, moving and sharing what might be thought of as data tracers that establish digital provenance.

Currently, this kind of valuable provenance information is essentially unavailable. But Frew has developed Earth System Science Server software, a “passive” system designed to generate provenance files for everything you do on your computer. You might activate it through a menu item after logging in, and it would then operate much as a database does, collecting the information off to the side as you work.

In the digital realm, some aspects of provenance are easier to track empirically. Frew provides the simple example of creating, saving and closing a Word document.

“Here’s a file called ‘article.doc,’ created by an instance of MS Word, which was started on your computer at a certain time on a certain date, and we infer from the fact that it was running with your identity and your privileges, that you started it.”

But that’s not necessarily the provenance that concerns policymakers. “The kind of information environmental managers care about starts with a sensor, a model or a field observation but winds up as a decision to, say, close a beach because the coliform count is too high in the water,” Frew says.

Provenance would be useful in the nonscientific world because we live in a time when information passes through ever fewer gatekeepers, or even none at all. “Information is coming at us raw and unmediated, and basically it’s up to us to make the judgments about whether to trust the source or not,” Frew says. “To do that, we are either going to have to drill down into the sources for every single article we read, or we can automate a lot of these kinds of judgments.

“If I have the provenance, I ought to be able to instruct a program to walk backward through it and tell me if there’s anything back there it doesn’t like, according to rules I set up. So I might be reading along and come to something that sounds bogus, and all of a sudden my bot pops up and says, ‘Warning, significant assumption in article came from Glenn Beck, broadcast on such and such a date” — not because the program doesn’t trust Glenn Beck as a source, but because I don’t, so I’ve written a rule that tells the bot to flag anything that traces back to him.”

Those same concerns also play out in the scientific sphere.

“You also need some level of confidence in origins of the data and what was done to it,” Frew explains. “You might know where it came from but not trust someone in the chain, or there may have been assumptions made in selecting or interpreting information, or a sensor or a program might have gone haywire.”

He works extensively on remote sensing of snow in the American West and so gives the example drawn from California’s Sierra Nevada mountains. The model showed a lot of snow in the middle of highly alkaline Mono Lake, which never freezes.

“That’s just not right,” he says. “So we go back, and we see that it isn’t bad satellite data; it’s our algorithm, a numerical problem. At least we knew that all the field spectral data that we went to a lot of trouble to measure, and all the satellite data we laboriously collected, were OK. We could fix the algorithm and rerun the model, as opposed to saying, ‘Oh, no! We have bad snow data and no clue where it went wrong.’”

Having the provenance also allows you to know if a glitch might affect only some kinds of analyses downstream and not others.

“A bad sensor providing bad information could ruin a bunch of stuff I calculated over here but not at all the stuff I have over here,” Frew explains. “Provenance tells me not only that there is a malfunction with the sensor but what the malfunction is and whether it matters or not.”

While newspaper and blog readers are not likely to set up elaborate sets of rules to apply to articles they read, Frew suggests that, as provenance tags become common, they will spread, perhaps eventually becoming an accepted standard, like footnotes, for e-journalists and other distributors of digital reading material. At that point, he says, sources that don’t include them may be seen as less reliable.

The real challenge for provenance is making it an international system that works for every computer everywhere. Right now, Frew says, “There is no standard format for provenance files, no standard way of accumulating them, and no standard way of exposing them, shipping them around to other places or stitching them together.

“The big idea is to try to weave this language into the fabric of the Web so that, just as hyperlinks, html, news feeds, JavaScript and other elements are part of the Web architecture, and everybody agrees on how they ought to work, we’ll have a separate language that is a standard way of expressing provenance information and standard ways to request it and move it across the web.”

Once there is general agreement, the rest of it involves people saying, ‘Gee, if I want to play in this pool I should do it that way.’”

Sign up for the free Miller-McCune.com e-newsletter.

“Like” Miller-McCune on Facebook.

Follow Miller-McCune on Twitter.

Add Miller-McCune.com news to your site.