Why ‘the tsunami of crap’ doesn’t matter

Andrew Updegrove offers some gloomy prognostications about the difficulty of finding books one wants to read, and the continuing necessity of gatekeepers: reblogged at The Passive Voice.

Actually, his fears are groundless, and his prescriptions wide of the mark. Chiefly for my own records, I reproduce here what I had to say about the matter, with slight additions:


 

Discoverability is not linear, but logarithmic.

That is to say: Finding what you want out of 100 different choices is not 10 times as hard as finding what you want out of 10 different choices. It is only twice as hard. The difficulty of choice increases not in proportion to n, but in proportion to log n. (This is why decimal notation was such a brilliant invention. One digit is enough to specify a number from 0 to 9, but two digits will specify a number all the way up to 99. With just six digits, you can choose one particular number out of a million.)

Even before the sea change in self-publishing, traditional publishers in the United States alone were putting out over 100,000 books a year. Log(100,000) = 5. The ‘tsunami of crap’ is putting out something over 1,000,000 books a year. Log(1,000,000) = 6. The difficulty of finding what you want now, compared to then, is therefore increased by a factor of 6/5. This is more than compensated by the extra help we now have from things like book blogs and Amazon’s also-bot.

The result: readers actually have an easier time than before of finding books that they want to read; and because there are more books to choose from, they are more likely to find something that suits their needs and interests exactly. No ‘curators’ required.


 

Another commenter asked why searching should be logarithmic rather than linear, and I explained:


It’s actually a general mathematical law, and applies to any kind of searching, whether addressing computer memory or extracting copper ore from the earth’s surface. The particular organizing principle that one uses, of course, depends on the exact conditions. In this case, as you suggest, much of it has to do with fine levels of categorization in online bookshops. But not all. A lot has to do with the fact that readers are free to speak and write about books, and recommend good ones to their friends (face-to-face or online).

Here is a short illustration of why it is not linear.

Linear search: You go to the supermarket to buy tomatoes, so that you can make sauce for your spaghetti. You start at the northeast corner of the store, and end at the southwest corner (so you can be sure to cover every inch of every shelf), and pick up each product one by one. You look at each product and ask: ‘Is this a tomato?’ Keep going until the answer is ‘yes’.

If there are n products in the store, you will have at most to look at all n of them. The more products, the longer the search, in one-to-one, linear proportion.

Nobody actually does it this way. If you tried it, they would come and take you away before you got through the first aisle. Asking thousands of products, ‘Are you a tomato?’ is a good way to get yourself marked as (ahem) an unreliable person.

Here, instead, is what people actually do.

Logarithmic search: You go to the supermarket to buy tomatoes. Tomatoes are produce, so first you go to the produce department, ignoring all the rest of the store. One end of the produce is fruit; the other end is durable bulk vegetables, like potatoes and onions; so you go to the salad vegetables in the middle, ignoring the rest of the produce. In that section, you ignore the display case with the green leafy vegetables, and the one with the bell peppers, and so forth, and go straight for the one with the tomatoes. And once you are looking at the tomatoes, you ignore the beefsteak tomatoes and the little cherry tomatoes (if you know anything about making spaghetti sauce), and pick out a few choice items from the bin of ripe red Roma tomatoes.

At every step, you eliminate most of the remaining choices, and you do not even begin looking at individual items until you have narrowed your search down to a very specific category. If each category is divisible into b subcategories, the number of preliminary steps is at most logb(n). In the famous game of Twenty Questions, the number of steps is 20, and each yes-or-no question divides the possible solution into exactly 2 subcategories. With 20 questions, assuming that you ask exactly the right questions, you can uniquely choose between 220 = 1,048,576 answers. (20 = log2(1,048,576).) If you were allowed just twice as many questions, you could uniquely choose between 240, or more than a trillion possible answers. It will be a very long time indeed before the writers of the world publish a trillion books!

The length of the search is proportional to the logarithm of the total number of items n in the store: Q.E.D.

Now, you will note that the linear search is the worst possible case (aside from running around at random and re-examining items that you have looked at before). Any method of systematizing your search will give you better than linear results; and the better the search method, the more quickly it will approach the logarithmic ideal.

With the aid of a good library catalogue (and while Amazon is not a library, its catalogue is the most efficient in the world), you can take a shortcut even on the logarithmic ideal: you can skip most of the preliminary steps and go straight to the subcategory you want, in as much detail as you care to specify.

For instance, I have an interest in ancient coins, and I want a book on the coins of Imperial Rome. Just this minute I typed into the Amazon search box: ‘books on Imperial Roman coins’. Out of all the millions of books for sale, the system narrowed it down in one step to just 84 titles, presenting the 16 best matches (‘best’ by a combination of sales rank and relevance, for all else being equal, the better-selling book has proved itself relevant to more people, and is a better guess) on the first page. The very first hit is David van Meter’s Handbook of Roman Imperial Coins, which is exactly what I would like; but it costs $59.99 in paperback, which is more than I want to spend. Scrolling down a bit, I see Roman History from Coins: Some uses of the Imperial Coinage to the Historian, by Michael Grant, for just $20.78. Michael Grant is a distinguished historian in the general field of Roman history; I know that name, and know how far to trust it; so I look at that one. There is, unfortunately, no ebook or ‘look inside’, but the description and the editorial and customer reviews show that it pretty exactly matches what I want. So, if I actually wanted to buy the book instead of just giving an example, it would be on its way to me already. Elapsed time: less than five minutes, and most of that time I spent typing this description of the process.

Sales rank of Grant‘s book: #2,513,097. Through the old-fashioned media of physical bookshops, public and university libraries, and Inter-Library Loan, I might never have heard that there was any such book. (I might have stumbled upon it in looking at Grant’s other books, but not in a timely fashion, for his name would not have occurred to me in this particular context.) And yet I found that book in a minute or so, in preference to all the millions ahead of it in the sales rankings.

No, finding what you want as a reader does not even begin to be a problem anymore. The problems are all on the writer’s side now, and even those problems are easier to solve than they used to be, when the only means of discoverability was to be one of the magic 1% who got a publishing contract (and then did not go promptly out of print).

Comments

  1. I would be interested to hear what those ” problems on the writer’s side now” are, and what a writer can do to address them./

Speak Your Mind

*