In Search of a Better Algorithm

Try searching on “allen iverson email” in Google. The third result is a link to this blog.

Why?  Because I once posted the fact that I love the way Allen Iverson plays basketball, because the word “email” is high up on the front page of this blog, and because there are a ton of inbound and outbound links to this blog.  But is AVC one of the best results when you are searching for Allen Iverson’s email address?  Not likely.  But there are close to a couple dozen comments on that Allen Iverson post that suggest some people think it is.

And the same thing happens to lots of other bloggers.  Try searching on Google or Yahoo! for “oprah backlash james frey”.  You will be directed to Brad Feld’s blog for similar reasons.

Why am I telling you all this?  Because as great an experience as searching the Internet is on Yahoo!, Google, Microsoft, or Ask, Internet search is still a very primitive technology.  Rarely is the first result of a search the best result for my needs, regardless of what engine I use.

I’ve been thinking a lot about search lately.  I do a lot of searching on the Internet even though I have literally hundreds of sites bookmarked and have at least fifty to a hundred sties that I visit on a regular basis and know the URL by heart.

I tend to use Yahoo! for most of my searches as I have made it the default search in Firefox.  After that I use Google.  I rarely use Microsoft or Ask.

But last week, as a result of the relaunch of Ask.com, I did some searching on Ask and I got very different results on my standard test searches; fred wilson, vc, union square ventures, wilco, flaming lips, digital camera, and a few others. Ask does not appear to be using link ranking nearly as much as Google and Yahoo!

When you search on “fred wilson” or “vc” on Yahoo! or Google, this blog is the first result on both for those search terms.  I have always thought that was because those keywords appear high up on the front page of my blog (vc is in the title) and because of the large number of inbound and outbound links that this blog has accumulated in the 2 ½ years that I have been blogging.

But when you search on “fred Wilson” or “vc” on Ask.com, you get a whole bunch of other results. It’s basically what those search terms returned on Google or Yahoo! back in 2003 before I started blogging.  So that means to me that Ask.com doesn’t seem to care much about link rank.

I don’t know if that’s a good thing or a bad thing. Because when you are searching for Fred Wilson, there is as good a chance you want the artist, chess master, or rock n roll band, as you want this blog. When you search on “vc” on Google or Yahoo!, your first link is this blog.  Is that the best result for vc?  I doubt it.

Text search works well enough to be useful, but it doesn’t work well.

And before we get close to perfecting text search, we are off to new horizons with audio search, video search, etc.  Will Google, Yahoo!, Microsoft, Ask and others continue to invest in improving text search or move their efforts to searching other forms of media?  I suppose the answer is both but clearly there is an impression among many that text search has “been done” and that impression is wrong.

I believe there is opportunity in improving text search, but few investors want to take on Google, Yahoo!, Microsoft, and others by backing another search company for good reasons.

But the immaturity of search was one of the many reasons I loved our investment in delicious so much.  I thought, and still do to some degree now that Yahoo! owns delicious, that searching user generated tags instead of (or more likely in addition to) some computer generated index would generate a better result.  Of course, tagging must become a much more popular behavior before we can have a tag database that can deliver high quality results for all search terms.

There is also the promise of shared searching which Yahoo! is promoting with MyWeb (why don’t they just merge delicious and MyWeb and let us choose either interface?).  David Hayden is chasing a similar vision of shared search with Jeteye.

And there are next generation tagging services coming like Plum that may offer some new ideas in this area of shared search and discovery.

I frankly think that an orthogonal attack on search via something that is seen as very different is the more intelligent way to approach this problem.  First, it’s more likely to obtain investment.  Second, most users aren’t going to start searching with a different engine unless they see the benefits first.  So you have to hook them on something where the initial value proposition is something else (tagging, social networking, looking at videos, etc).

And then there is Alexa to consider.  Amazon has supposedly opened up the Alexa search service so that others can use it.  I know of at least one company that has taken them up on that offer, although they found that the demand to use Alexa was larger than Amazon could initially support.  I haven’t checked back with that company to see how they are doing with Alexa.

What if we had an “open source” search engine that everyone working in and around the area of search could plug into?  The companies working in tagging, shared searching, audio and video search could offer their results/indexes to the open source search engine so that their meta data could be considered in preparing the best results?  Could this work? And what would the business models be for the companies supplying the meta data?  And would consumers adopt such an engine?

I am not sure, but I am sure that we are in the first or second inning of the search ballgame, and nowhere near the seventh inning stretch.  So for all those entrepreneurs who are way smarter about this stuff than I am, let me encourage you to be thinking about search as much as I am these days.  It’s still a huge opportunity.

Comments

im waiting for the search engine who sees the internet the way it really is... a distributed network... traditional search engines like google see it as a centralized network.

Nutch is (was?) an open-source search engine that was backed in part by Overture/Yahoo and has been around a few years:
http://lucene.apache.org/nutch/docs/en/

To me, the big question with search relevancy is identifying user intent. Is a searcher for [fred wilson] looking for the VC, artist, chess master, or rock n roll band? As research has shown, most users are only going to enter 1-2 keywords in their search term.

So, I think the real areas for improvement with search are in the user interface. How can we help users refine their search and quickly get from thousands of "search results" to the "search answer" they were looking for...?

Clusty disambiguates "Fred Wilson" pretty well!

Your post got me thinking about Clusty.com which I hadn't tried in a while. The first four results for "Fres Wilson" are the band, the Canadian politician, the chess master, and the VC. The artist is a few links down but "Artist" is one of the clusters presented on the LHS. Not bad!

I am of the opinion that off page factors (inbound link text, number of links from 'on topic' sources of the given subject search context [search for 'Hilltop Algo'/or 'Latent Semantic Indexing (LSI)'] still matter just as much as on page factors for both G and Y!. Along with Domain Name still being a key factor in specific keyword searches. Also I think that a lot of emphasis is placed on tenure of the domain itself and I think if G really wants to cut out the spam then they are going to have to find new methods to verify originality of content (ie. who posted it first). In the world of RSS, I can have 50 splogs with my entire post on them within 10 seconds of my post hitting the web, which is one of the main reasons that I only publish a partial feed. It sucks it has to work that way, but for now, that just the way it works. As for Ask/(and one of my favorite SEs Teoma) I have been somewhat impressed by their results as of late. I still use G (even though I posted earlier in the year I was going to try to quite using it along with you), but I have been utilizing Y! a lot more. MSN is decent, but I think that their 'neural net' is going to have to learn a little bit quicker to keep up, and their results are all over the place. I can see some fluxes on G and Y! from day to day, but MSN is chaotic, although they still seem to offer good results when I search for something related to coding/software/hardware. Here is my breakdown of which Search Engines I use for when I am doing specific queries: Tech related - MSN/Google, searching on a specific product's information - Y!, searching for a specific product price - Froogle/Pricewatch, information/trivia searches - Teoma (basically same as ask.com), and one search engine that doesn’t get a lot of attention that does basic search fairly well is Gigablast (and has a good API for developers). I completely agree with you in the fact that accuracy of search is still in it's infancy, and I do think it has a long way to go before, people get exactly what they are looking for. I think phase 1 might have been the scanning/indexing part of search engine research, phase 2 was/is localization (which is getting fairly accurate), phase 3 is the personalization (RSS Readers from the SEs are helping with that along with the 'portalization' of some other SEs), that is the current phase I think we are in right now. Phase 4+ will be interesting because I think that is where we are going to see the divergence in strategies between the different engines.

Nice post.

The problem with relying on computer algorithms for search is that computers don't know what the user is thinking about. Like you said, someone can search for "Fred Wilson".. but how is Google or Yahoo! or even a social bookmarking site supposed to know what the person is after? Maybe they want the chess dude. Maybe they want the rock and roll band. Maybe they want the Fred Wilson they went to school with back in 1986.

Computers don't know these things, and to have better search, it'd be helpful if they DID know. But I don't think we're going to be plugging our brains into a computer port anytime soon.

Alas, the next generation in search is not based on algorithms but on semantic rules. These sort of engines have been around for a while, especialy in Europe (because the USA dropped teaching linguistic courses x0 years ago). They target niche markets like intranet search, military intelligence, aso. This is because they cost a load. And they are expensive because nobody knows how to build dictionaries automaticaly. It's all hand made. Any volunteer to fund a company that will build the universal semantic dictionary in order to search the web with more accuracy? I have not run the maths on the related investment but it ought to be in the billion dollars and hundred years range. Exit strategy might be the extinct of the human race and sell the behemoth to the next dominant species.

Wow, some of the comments on that Iverson post are rather pathetic. Talk about idolatry.

Haven't these people anything better to do?

I use del.icio.us for my own bookmarks, but I wasn't sure about it for search until this weekend. A friend wanted to find a very specific video. He searched through multiple pages of standard results, but couldn't find it. Out of curiosity, I placed the exact same two terms into del.icio.us and received one result: the right one.

It amazed me, yet it also seems obvious. We search with keywords, so of course tags aid in search.

If there isn't already an open source search engine like you mention, there should be.

WHY IS GOOGLE GETTING WHIPPED IN SOUTH KOREA BY NHN?
http://bernardmoon.blogspot.com/2006/02/why-is-google-getting-whipped-in-south_01.html

Korea is the future, right?

Search technology is approaching fundamental limitations as it is currently optimized for a text-centric, web-based, publishing environment. There is only so far that statistical machine learning techniques can go with respect to the analysis of HTML pages before the results they produce are no longer delivering significant value additions to end users. Can you ask Google to find you all the black shoes on the web under $30, or find all the reviews from your favorite blogger that are associated with wireless routers? No and no. The problem is that HTML lacks structure and context. That being said, however, as the web transforms from a publication mechanism into a platform for distributed, online services, there is an opportunity to leverage existing machine learning techniques on an increasingly structured and service-oriented web to not only dramatically improve the relevance of query results, but also to power a new class of functionality on the web.

I thought, and still do to some degree now that Yahoo! owns delicious, that searching user generated tags instead of (or more likely in addition to) some computer generated index would generate a better result.

Yep, agreed, except for the "tags" part. I believe something a bit more general is needed.

I think that search needs to move to the client... Something like Root Markets but with an open source client that sits on the desktop and logs your clickstream, not only in your browser, but across your applications. More on my bizarre thoughts here:

http://www.ottow.net/blog/2006/02/livin-on-edge.html

But we ARE "plugging our brains into a computer port" in a sense when we tag content and a computer algorithm tries to make sense of the tags.

For example, Flickr is getting somewhere with a combination of tagging and clustering. Tags are human-assigned, but Flickr starts noticing relationships between tags.


It’s a paradox…

On one hand, I directly hear quotes like this: “My customers are begging for accuracy! Please, please, please keep me informed about an Enigma-based (a new job search methodology called Themetics, based on combining thematics and genetics) board, as I will undoubtedly want to try it.”

On the other hand is this quote from The Search by John Battelle: “As long as we're 80 percent as good as our competitors, that's good enough. Our users don't really care about search."

So users and intermediaries want better search tools, but search providers don’t seem to see the market for improved search. In fact, my belief is that poor search quality is what allows Google to generate billions of dollars from ads, since users are more likely to click on ads when they appear more relevant that the organic results.

Without knowing the context of the users request, search engines return some good results mixed in with the poor, turning an impossible task into a manageable one … but not a satisfying one; and thus the door remains open for the next revolution.

In the job search space, Themetics improves search accuracy dramatically over the current best-of-class job search engine (CareerBuilder using FAST). For example, try searching for auto sales on any of the major job search engines – avoid using quotation marks since the job ad may be using words such as automobile, car, Toyota, salesperson, etc. – all of which would be missed in an exact string search.

Using Themetics, each search string is first interpreted – in this case, the most likely category is deemed to be retail sales; then jobs within the selected category are examined for variations of the auto sales theme. The result is a highly precise listing of auto sales jobs generated with lower user interaction requirements than current search models. Yet, despite demos for several job boards, none have expressed serious interest to date.

So I guess the question is this – is there really a market for improved search?

Bob

I think that it's impossible, for the time being, to develop something so that each individual person get's a single answer tailored to their needs...

What I do think is possible is this:

1) A system based on averages with a time variable

2) Tagging, and the option to browse results by grouped tags

3a) Bookmark service - allow users to easily access their bookmarks and add new bookmarks (a browser pluggin?).

3b) Bookmarks can be tagged and catagorized automatically if needed (by seeing how other users catagorize that URL or similar sites)

3c) General search queries can be based on your bookmarked content. For example, if you have some chess bookmarks, then "Fred Wilson" will bring up the chess player as a first choice.

More, but I am working on them.

Interesting, it is true! with search almost our second-hand activity every day, it still is probably the most infant of all technologies now. As you say, the core search functionality certainly needs to be tuned. I think the tagging and clustering are offshoots or workarounds of this core problem of'search' algorithms.
By the way, came across this search engine - metacarta...http://www.metacarta.com/...which relates the search to geographic spatials. Well a search for 'Fred Wilson' did not bring me anywhere close to this blog.

sounds like maybe you're pining for the old search engine "index" function/interface? back in the early days, yahoo, lycos et. al. assumed searchers preffered to be directed to their index first and raw search results second. why? because the index had been pre-screened and was organized and displayed by category and sub-category (as best as they could.) so one searched for "fred wilson" and was presented with index category choices: do you want fred wilson: venture capital, fred wilson: artist or fred wilson: chess, etc.?

but yahoo had to employ thousands of people just labelling sites for the index, as did ask jeeves, et. al., and that's untenable.

and everybody now has spidering software that, as batelle says, is "good enough."

so maybe the next wave is all about interface design, not better algorithms? if we the public can relax our demand for instant gratification just a hair, maybe the next great search engine still has a great clean empty look like google, but doesnt return any search results at all when you type in "fred wilson" -- instead it has another simple clean screen which offers one or more multiple choice questions to help guide the searcher into the "fred wilson" universe he/she actually wants to rifle (while always allowing the individual to at any point abandon the Q&A trail and simply dive into spidered search results right then)...

steve's posted comment about the use of "index" function interface is spot on.
why not the offer to use it if desired on the first search page for those who need better than "good enough"?

You have a good point about the odd search results. You're number four on MSN search for "internet search algorithm research."

I personally love the idea of an open source search algorithm. I'm a CS student in my freshman year of college, and doing my first research paper on the above search topic. So far, I haven't had a whole lot of luck in my search - I keep getting blogs.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment