The world creates terabytes of data every hour. The duplication rate is higher. As I type these words into the document that will become this article, I am creating data. When I post the document file onto the web server that supports the site, I am duplicating the data. Editors review the words, making changes, converting the duplicated data into derivative data. As you read these words, does the data become information?
Let’s define a distinction among three different kinds of data: created data, derivative data, and duplicated data.
Created data is original, not found elsewhere. This original data appears in the work of statistical agencies and organizations whose mission is to create data. The US Bureau of Labor Statistics, US Department of Commerce, Statistics Canada, the New York Stock Exchange, and NASDAQ are just a few of many, many organizations that survey the world and report the data (the stock exchanges are simply timely surveys of the selling prices of stocks). Created data includes the financial reporting from companies, product catalogues, original news articles, and original writings.
Original writings are perhaps the most interesting data. When someone writes, they present a synthesis of data, collected and processed into what could be a new idea. Not all writing we find on the internet is original, because there are so many examples of content that is nothing more than a warmed-over reworking of original material created by a content generator.
Derivative data is a creative effort that blends original data from other sources into a new interpretation. The content generating writers I described above are just one example of derivative data.
Sometimes derivative data can be of greater value than created data. For over 15 years, people turned to Yahoo Finance for financial data about publically traded companies. Yahoo Finance blends created data from different sources into a single platform, making it easier for people to find information, the kind that answers the questions they are asking. These sites also provide a portal to other places users can potentially go to gather more information. (I say potentially because the data may not answer the question.)
Derivative data includes mashups, blogs, and other content on websites. What you are reading this moment is both created data and derivative data. The created data is the words typed. The derivative data consists of the other elements on the page, like the images, which were found elsewhere and included here to provide greater context and understanding. I think that most of the data found on the internet is derivative data. “LOL Cats” is derivative data, as is every news-reporting site.
Duplicated data is simply that — duplicated. Duplicated data lives in the body of websites. The reposting on Facebook of photos found on the internet is one form of duplicated data. Reposted quotations, news stories, and videos are all forms of duplicated data. Our capacity to duplicate data is astounding. Photographs, images, and videos are perhaps the fastest growing classes of data on the web.
Some data falls under all three classifications. Consider an e-mail message string in which the writers include an extensive cc list and each participant chooses to include the past parts of the message in the new message. The first message is original data, and the replies are a combination of original (the reply text), derivative (the thoughts of the past message woven into the reply) and duplicated (the repetition of the original text).
Duplication is not a bad action that reduces the value of the data. Derivative actions do not automatically make data more valuable. Original content is not necessarily of highest value. Context determines the value of data.
At the instant data comes into existence, who or what defines the context of that data? Who validates the accuracy of the data? As the data is collected, processed, cataloged, and combined, how much of the data becomes useless? How much remains useful? How much was never useful in the first place? What gives us the meaning of the data? We must ask these important questions about any data we use to guide our decisions.
Asking the question, “How valid is the data?” is not the same as asking, “How accurate is the data?” Validity addresses the question of how appropriate the data is to answering our questions. Can we justify the use of the data to answer the questions? Does using a piece of data make logical sense in the context of our questions?
Accuracy is another key question. Even if you know the source of the data, things happen in the collection process or in the development process that can distort the accuracy of the data. Incomplete data sets can ruin an analysis project. However, incomplete data can also illuminate opportunities and unanswered questions.
Data can be dead accurate and still be invalid. One of the ways to attack research findings is to challenge the accuracy of the data used in the research. That is a poor tactic. It is far more effective to challenge the validity of the data.
This should illustrate the point. In retail, there is a metric used to measure customer satisfaction, the walk-to-buy ratio. It is a measure of how many customers walked out of the store without purchasing something to how many customers did make a purchase. Expressed as a factional or decimal value, walk-to-buy is like a golf score; lower is better. The parallel to the walk-to-buy ratio is known as the conversion rate, a measurement of the percentage of people who entered the store and bought something.
In practice, walk-to-buy and conversion rates are not difficult to measure. Many retail stores set up traffic counters (think of electric eye beams and other sensors) at their entrances and record the number of people who enter the store. Web sites can do the same thing with traffic counters, measuring how many people visit a specific page. Then you count the number of sales transactions for the day. Divide the transactions by the traffic and you have the conversion rate. Subtract the transactions from the traffic and divide by the traffic and you have the walk-to-buy.
Now comes the hard part. What is a good walk-to-buy ratio or conversion rate? Several factors affect the ratios, so in considering what good is, we have to say, “It depends.” In my book, anytime we have to answer, “It depends,” we are attempting to answer a poor question about the relative quality of some data.
Why did the customer who walked in the door or visited the website not make a purchase? The electric eye and the register receipt will not answer that question. The data from the electric eye will not tell us which of the infinite number of possible reasons is why the customer did not make a purchase. Without collecting and analyzing other data, we can’t even begin to really understand if a result is good or bad.
Collecting the data to determine why someone does not buy takes effort. You could station an employee at the exit door and have them interview the people leaving the store without a package. Did they even have intent to buy something when they came into the store? That person walking out without a package could be a customer’s kid, or their wife.
INSUFFICIENT DATA FOR MEANINGFUL ANSWER. Perhaps our problem is more about the question we are asking, and not a lack of data.