Making In-World Search More Relevant

In several of my recent posts, I have mentioned different ways to improve In-World Search in ways that extend beyond what exists today. Basically that means adding features and functions to In-World Search that do not currently exist. While that’s a grand goal, it’s also pretty much a pipe dream. Why? Well, Linden Lab has a long-standing reputation of “Doing It Our Way”. Actually, it’s more like “Doing It Anyway EXCEPT Your Way”. Not to single them out as if they’re the only tech-based company to behave that way … because they aren’t. In fact, if anything, they are more the norm in that manner than the exception.

But with that despairingly blunt introduction, there are things that can be done without extending or adding to the existing services and infrastructure. They need to shift their perspective just a bit, gain a better understanding of what really is “Out There” (inside Second Life) and accept the well-proven truth that damage to In-World Search is directly traceable right to damage in Linden Lab’s purse. Once they do that, internalizing and implementing some of the ideas I present below should be a simple matter of “Git ‘er done!”

Before I dive in, one thing I want to mention up front. I abhor the tendency of most highly specialized regimen’s to speak only in their internal lingo. During this post, I will be presenting some concepts and calculations that could quite easily be presented in denso-lingo that would make most people’s heads explode. However I will endeavor to keep the language “street level” so everyone can benefit. If you’re one of those wonks that knows all the neat terms for the stuff I’m describing, grab your magic marker and cross out my long-winded phrases then write the correct short-term on your screen. That oughta fix it for ya.

The Start – What Is A Parcel Listing?

Within Second Life, all Land is divided into units called Regions (or Simulators or Sims). Regions are further sub-divided into one or more Parcels. A Parcel is the smallest unit of control and measurement that matters for most functions, including In-World Search. In the controls for a Parcel, you must mark a Parcel to “Show In Search” before it will be indexed and displayed in Search Results. Of the various Objects located on the Parcel, you have the option to mark them to “Show In Search” as well. (Linden Lab charges L$30 per week to list your Parcel in Search, however you may mark as many Objects to appear in Search as you wish without additional cost.) So part of the Parcel Listing consists of an alphabetical list of all the Objects you have marked to appear in Search.

There are other details about the Parcel that are included in the Parcel Listing. These include:

  • Parcel Name and Description
  • Region Name
  • Location within the Region
  • Search Category
  • Owner Type (Individual Avatar or Group) and Owner Name
  • Parcel Area (also called Size)
  • Parcel Image

All in all, a Parcel Listing should (and for the most part does) tell you everything you need to know about the Parcel. It should also help you identify the main theme of the Parcel. As a Human Being reading through a Parcel Listing, it’s pretty easy for most of us to instantly recognize the theme (or themes) but also get a general idea of the quantity and type of the available “Products for Sale” on that Parcel.

However it’s not that easy for a computer to do the same “read through” and come to a similar conclusion. It’s not that computers are stupid (okay they mostly are) but it’s more because computers have a hard time measuring the nuance of quality and diversity, and they totally suck at reading your mind when you’re trying to find something in particular. However computers are fast … really really fast. So anything we design as far as a competent “Relevance Algorithm” must take into account the opposing forces of “Fast but Stupid”. Fortunately, we can easily get there just by understanding the problem and the goal, then building an algorithm that is based from that understanding.

Understanding Is A Many Splintered Thing

There are two opposing directions included in that whole “Understanding” thing. From the Shopper’s perspective, they want to find products that meet their personal qualifications, fit their need and taste, are within their price range, and aren’t just more of the “same old thing” they’ve seen 100,000 times before. From the Merchant’s perspective, they want to easily express the diversity and quality of their products, to be able to word things in plain-spoken ways (without having to calculate a bunch of numbers and carefully construct their listings) and to appear first (or nearly so) whenever someone uses Search to find the stuff they’re selling.

You can easily see that the primary goal of both Shoppers and Merchants is essentially the same. They want Search to act as the bridge in the middle that links them together with a minimum of fuss and bother and without spending a lot of time wandering all over the Grid. If there was exactly one Merchant for every type of product, the goal of connecting the two would be ultra-simple. However there isn’t, so the job isn’t that simple. The Shopper will have to do a bit more digging to find just the stuff they want.

Personal Observation: Anytime a computer algorithm is designed that depends on the User being “Educated” first, it turns out to be an algorithm that doesn’t work. It’s okay to depend on the User having a basic understanding, but expecting them to master or even be aware of complicated or unusual methods is doomed to failure. Therefore our Relevance Algorithm has to be as simple to operate as possible without demanding or expecting unusual expertise on the part of either the Shopper or the Merchant.

“I Want …”

From the Shopper’s perspective, they have formed a basic description of the object of their desires. That description may be as simple as “couch” all the way up to something as complex as “a couch with overstuffed arms, fold out recliner and contemporary design … and oh yeah, it’s gotta be tintable”. The more specific the description they’ve formulated, the more specific their results will be … kinda goes without saying. But all that aside, their primary goal is they want to just type in their description and get a list of places that sell what they want.

“I Sell …”

From the Merchant’s perspective, they have a selection of products they sell. They know they’re not the only one selling specific types of products, but they also know that diversity is how they make their products unique. They offer various colors, features and functions in unique combinations that they feel differentiate their stuff from the rest of the herd. Because they’ve spent a lot of time adding those features and making their products the best they can, they also want to be able to list those differences and have them “visible” to Shoppers that are looking for similar products. Of course, that goal is at the “far end” of specificity. At the “near end” is the goal that if a Shopper is looking for a “couch” and they sell couches, they want to be visible to those Shoppers as well.

Fuzzy vs. Specific

Now we come to the real thorny issue regarding Search Relevance … How specific is the description of the Goal/Product? When a Shopper types in a Search Phrase of just “couch”, immediately we can see that’s a pretty non-specific (or what we call “fuzzy”) description. Anyone searching for just “couch” has to be prepared to get back a jumble of listings that all have “couch” somewhere. How those bazillion results are ranked is something we’ll get into much later on. But when a Shopper types in a Search Phrase that is more specific, contains more qualifiers that can be used to exclude or include results, ranking those results gets a lot easier … as long as we understand that they are being very specific in their Search Query.

And here is where we come to the first important distinction in calculating Relevance in the Search Results … the specificity of the Search Query. What is returned as the set of Results changes radically when the Shopper is being specific as opposed to when they are being very fuzzy. The Search Relevance Algorithm must “shift gears” and use a completely different method of pulling up results when the Shopper is being specific. As it stands now, the same method is used no matter how specific the Search Query. That’s gotta change … and fortunately because Linden Lab is now using SOLR (wrapped around Lucene), it’s also fairly easy to create.

What Exactly Is Specific?

Pull on your hip-waders kids … from here on out it’s gonna get a bit deep. Look back a few paragraphs and find that long-ass phrase I used as an example for finding a specific type of couch. What you should notice upon re-reading it is that the Search Query has one Subject … “Couch”. All the rest of the words in it are just additional specifiers that help focus the search. Here is where that “Understanding” thing kicks in. When people are searching for something, they have a specific Object in mind to start. How they word the description of that Object, and how they word their Search Query are all based around that singular Object. As long as we can find that Object in their Search Query, we can also easily consider everything else as additional specifiers of that Object. So does that mean we have to teach the Search Engine the entire English language … and German, French, Italian, Japanese, Spanish, etc. etc.? Ummm .. nope. Not really.

Name = Object, Description = Specifiers

When a Merchant creates an item for sale or sets out a vendor for a product, they have two fields that they must fill out. One is called “Name” and the other is “Description”. (Did I just hear a quiet *ding*?) Yup, that’s right. The Name field is just a place to put the words for the Object, whereas the Description field is where they put the specifiers for that Object. Now I do realize that there is some crossover in the uses of the two fields. However when you have the entire Grid’s worth of Object Listings to chew through, it quickly becomes obvious that words appearing in the Name field most often are Object names, and words appearing most often in the Description field are specifiers for the Object names. The statistical distribution of words (after tossing out the usual Stop Words like “the”, “of”, etc.) will rapidly indicate which category each word belongs to. Once we have divided up all the words into the two categories then we can better understand the Search Query and how to find the Parcels that are most relevant.

Revisiting the long-ass search query again, the Subject is “Couch” and the phrases “overstuffed arms”, “fold out recliner”, “contemporary design” and “tintable” all narrow down the type of Couch. So how do we use this knowledge to better the relevance? Quite simple. By the numbers, it goes something like this:

  1. Find all Parcel Listings that have Objects set to Show In Search with the Subject “Couch” in the Object Name.
  2. Filter those results by finding all Parcel Listings that have one or more of the Specifier Phrases in the Object Descriptions.
  3. Rank the results based on Parcels with the most Specifiers found.
  4. Rank the results based on Parcels with the Subject in the Land Description and, to a lesser extent, the Land Name.
  5. Rank the results based on the “Magic Formula” for breaking ties in rank.

Downranking for Spam

One of the reasons for this new Search from Linden Lab is the age-old practice of “Spamming”. I won’t climb on my soapbox about it in this post, other than to say it has blossomed into full growth directly because of the methods they’ve implemented for Search. (And btw, the Search just released is just as bad at ensuring spam will exist. It will just be a slightly different form but will be just as annoying and counter-productive.)

However, using the above methodology, we can quickly see that the typical type of spam … repetitive stuffing of specific keywords … will have little effect. Stuffing the word “Couch” into the Object Name and Description a bunch of times won’t matter because we’re only looking in the Name for the Subject (so all the extra repetitions in the Description are ignored). Furthermore stuffing specific words (like “recliner”) won’t matter much because we’re looking for whole phrases and not just single word occurrences.

But even so, there will be folks that attempt to spam by using the same phrase over and over, and here’s where we apply a tiny bit of social engineering. When we find a Specifier more than once in the Description, we knock a tiny bit off the count rather than adding to it. For example, when we find the Specifier once, that counts for 1.0. But if we find it twice, that listing only counts for 0.9. Find it three times and it only counts for 0.7. Each additional occurrence knocks a bit more off the “weight”. Those that engage in keyword stuffing will quickly find their listings count for nothing. In short, the more they stuff, the less they get ranked.

Tie Breaking Magic Formula

Step #5 above mentions a “Magic Formula” used to break ties in rank. This is a lot more complicated than it might seem because by the time we get to the end of steps 1 through 4, we’ve generated a pretty good subset of results. However, what we have NOT found are Parcels that have more of the desired item, have better items, or anything similar. Steps 1 through 4 are only designed to find Parcels that actually have the Subject item listed in Search. There will be a minimal amount of ranking applied by virtue of those Parcels with the most Specifiers found, but it’s pretty easy to see that there’s going to be a LOT of Parcels that have all of them present … especially if the Shopper only gives one Specifier in their initial Search Query.

So how do we go about determining what Parcels should outrank others? (Time to put on your Global Thinking Cap .. and spin that propeller too.) The Search Database created as each Parcel is indexed will contain data from each and every Parcel on the Grid (that is marked to Show In Search). The beauty of giant data sets is it gives us a very large sample from which to draw conclusions. Parcels will range from the ultra-tiny (128 sqm) on up through Sim-sized Parcels (a full 65,536 sqm). Obviously we’ll also get a similarly varied range of Objects listed on those Parcels that contain the Subject. Looking at a single Parcel, we can quickly calculate a ratio of Objects that contain the Subject vs. the total number of Objects. We’ll call that the “Subject Density” … a number that ranges from 0.0 to 1.0.

The Subject Density has a couple of very handy side-effects. First and foremost is that it allows smaller Parcels to compete head-to-head with Sim-sized Parcels. A 1024 sqm Parcel that sells nothing but couches will have an equal chance against a full Sim that sells every type of furniture imaginable. The second benefit is … well let’s look at how Linden Lab has changed the Viewer and the second benefit will become a bit more clear.

Parcel Listings and The Dodo

Starting with the V2 Viewer, Linden Lab signaled that they are done showing the actual Parcel Listings to the Shoppers/Searchers. We all belly-ached about it, but true to form they have not budged one inch toward bringing them back. Now that they’ve announced the imminent demise of V1 Search, and the new V2 Search has no place nor reason to display those Listings anymore, it is even more obvious they won’t be returning anytime soon. But as most of us know, those things were damn important … especially for Sim-sized Parcels. A Shopper could quickly wander down the list, find the particular Object they wanted, then teleport directly to it (using the “Go” button). But now that the Listings are hidden, a Shopper can only teleport to “somewhere” on the Sim. They then have to wander around the entire Sim looking for the specific Object they want. I don’t know about you, but when I’m looking for something specific, wandering a whole Sim is NOT real high on my “fast, easy and fun” list.

So the second benefit to the “Subject Density” ratio will be to force Merchants with Sim-sized Parcels to subdivide them into specific departments. Each sub-parcel will wind up having a high Subject Density: one for couches, one for chairs, one for rugs, one for … etc. etc. This subdividing will be a natural reaction to the new Search Ranking methodology and coincidentally will benefit the Shoppers by allowing them to teleport directly to the department (sub-parcel) that contains the stuff they’re looking for. (Okay, I know it’s a lot of work, but it’s a natural extension and it really does make it better for your prospective customers to find the things they want … not get frustrated and teleport away.)

Even More Ways To Rank

Another benefit to the SOLR/Lucene package is that the crew at Linden Lab can hook in several attributes that are not presented on the Parcel Listing page. Among those are such things as number of sales, sales per visitor, average price of Objects, etc. These qualifiers only become truly useful though when we are able to examine the entire Grid as a whole. These qualifiers can be plotted against a range graph that winds up looking like our old favorite from school … The Bell Curve. As we begin applying more and more of these qualifiers to help break a tie in initial ranking, we can calculate how far off the “Peak” a particular Parcel lies. That measure determines to a large extent how far off the most popular the Parcel is as well.

End of Part One

I know you’ve reached that “ate too much” stage, so I’ll pull up short … for now. Very soon I will dive a bit deeper into the Ranking Factors, and work through a few more examples to help lock these ideas in place. I will also tackle the Search Relevance calculation when a Searcher/Shopper types in a very non-specific Query (such as only typing “couch”).

In the meantime, any ideas or suggestions that might spring to mind are best added below as comments. (Brickbats and WTF’s are accepted as well, just please keep it clean … ish. K?)


Visit the DGP4SL Store on SL Marketplace

Comments

Comments are closed.