What's "Data Theft?"

Pedro A. D. Rezende
Departament of Computer Science
University of Brasilia
  February 14, 2013

Attempts to introduce legal protection for "data", as if "data" could be some sort of appropriable symbolic good (such as a "brand" in trademark law, or as "works of the human spirit" in copyright law), are a recipe for disaster, as we shall argue. Those in favor may cunningly think the same, as suggested by the sly trajectory of such attempts in parliaments, exemplified in Brazil by the case of AI5 Digital.

No matter how it is introduced, as an extension of whatever code of law, its accomplishment will destabilize legal systems with such potential abusiveness, with so vast collateral effects in Computing and related practical fields connected to legal theories of proof, to the point of humiliating the creativity of a Lewis Carroll in his most puerile fables.

For "Data" is something that only exists as a function of some code or language, and it serves to encode information only in some cognitive context. That is to say, in a context where signals may transmit them through a channel in time or space, between a source and a destination such that at least one of them is cognitively apt: the one(s) who will have his/her state of knowledge changed as a function of the information so transferred.

So says the mathematical theory of information, proposed by Claude Shannon in 1948, which gave us the basis for computer science itself. If we take the notions of "data" and "information" from there, we are led to conclude that their functions are impossible to be embraced by the legal concept of property without harming the logical consistency of its practical applications in any minimally coherent legal system.

Inconsistencies would stem from the fact that in reality the same data or information can be in many places at the same time, and with no generative relation between its occurrences or instances. The dogmatic filter that seeks to justify the legal protection of "data" or "information" is woven by an implicit but fallacious assumption: that some generative connection between these occurrences or instances ought to exist. However, what holds for trademarks and copyrighted works does not necessarily hold for "data" or "information" in general.

Here's an example that exposes the fallacy of such assumption: If the world has 1 billion Internet users who need to create passwords, and 90% of them create theirs with at most nine printable characters from an ASCII keyboard, then, by a counting argument, there will be people with, say, identical passwords. Who would have "stolen" the password from whom? Who would bear the burden of proof of "theft" -- or of "non-theft" -- of  such password, under a legal regime that typifies the crime of "data theft"?

Or, in another more interesting scenario: Who would be assigned the property of the data "3.14159 ...", which represents, in the decimal positional number system, the geometric constant named "pi"? If this data is declared public, would you and I be prohibited from, or need a license for, "appropriating" it in calculations? How about the property of the multiplication table in that number system, which is already in my mind? And the lexicon of a language? Who would control and how, ownership, possession of and permissions for using data, whether public or private?

In practice, such assumption can only be useful to trivialize criteria for probative efficacy of facts newly typeable as illegal acts. It will yield proofs with more chance of success and lower costs but with much less accuracy (enough to show possession). It will yield, say, proofs of crimes committed through illegal copying of bits, already covered by laws in case of trademarks and works of authorship (albeit with costly means in the digital realm for established patterns of accuracy), but it will do so at a social cost that seems unacceptable. Namely, by delegating the final definition of criminal offenses to owners or controllers of digital channels and systems in a society that is increasingly dependent on them as infrastructure.

This is a scenario fit for the reign of Kafkaesquean and Orwellian fictions, as shown in the Aaron Swartz case. The reign we may call of the "data-thing dogma", the belief that such assumption seems obvious, a trivial or incidental fact, and therefore implicitly acceptable by law. On the other hand, one can entertain the belief that this dogma sows the seeds of doublethinking and doubletalking parlance, from Orwell's "1984" onto the legal practice under laws typifying "data theft".

Thus, when the space for legislative debates on these attempts shows a bias towards the data-thing dogma from the start, or against the kind of discussion that this article tries to promote, the interpretation most fit to the case is that the main interest behind the spread of this dogma is compatible with interests to control the cost of probative efficacy of criminal types which are to remain undefined but fillable case-by-case by these same interests, in the guise of addressing a supposedly urgent and pressing theme (cybercrimes) regardless of this strategy's social cost.

Some scholars commenting on criminal procedure codes, such as Nelson Hungria in Brazil, like to talk about intangible goods that can and should be legally protected. For instance, goods like "energy" as subject to theft or robbery, as in the case of electricity served as utility. In this context, where one may be led to reason about possible analogies regarding "data" and "information" (since these entities may also be seen as intangible "goods"), we can consider the following conceptual flaws in this type of reasoning.

If we approach the possible consequences of such legal review with the help of Shannon's theory and from the semiotic perspective of Charles S. Peirce, boundaries and distinctions become clear: what energy can produce as service are signals, not symbols. The electric power at the antenna of a wireless device or at a copper wire in an Internet connection create signals for encoding bits (which in turn encode letters, pixels, program instructions, etc.). Like the ink of a pen laid on paper to form signals that encode letters (or figures, etc.)

Signals encode symbols in the face of some code, and therefore do not have any symbolic value in the absence of one. In the absence of a code, signals are just noise; either in an electromagnetic wave or in a wire of a computer network, they are just like scribbles on paper. No matter the media -- wave, wire or whatever --, data in digital form are just sequences of binary symbols carried by signals which encode them in some previously agreed code to represent information in a cognitive context associated with such agreement and transmission.

Outside the cognitive space of a receiver and/or transmitter of signals where such encoding agreement was or can be previously reached, "data" do not represent any information, nor does it have any informational value. This value comes from the cognitive context where data represent - or may represent - information for someone, it does not come from the signals themselves. Therefore, from a semiological perspective, stealing electricity from a utility service has some useful analogy not with, say, leakage of password or other "personal data", but with, say, the theft of a pen from a writer or of network bandwidth (transmission capacity) from a router.

The legal definition of theft presumes that a legitimate owner of something has been deprived of possession of this something, which does not happen in "data theft" as it happens in electricity theft. The term "data theft" is therefore an abuse of language, for all the examples ever given involve not the concept of theft as technically established in the legal field, but of illicit copying or trespassing, which are types already covered in some established field of law, irrespective of the implicit valued code being defined over another code (digital) or over a material media (paper).

