News Feeds
mod_dbrss2 AJAX RSS Reader poweredbysimplepie
|
 |
Recent |
Recent
|
Monday, 19 November 2007 |
|
As of 11/15, all indexes are back on a regular update schedule. News is updated daily, Podcasts and Blogs are updated weekly, and the full crawl of 24,000+ sites is now being incrementally updated continuously.
Emphasis in the search index continues to be on deeply search sites of utility for software developers.
In the queue for the next round of updates is better support for podcast and blog search, and performance improvements to the results clustering server.
You can access the search site directly HERE. If you want to bypass the Carrot2 Clustering Engine, go HERE.
|
|
Last Updated ( Wednesday, 02 January 2008 )
|
|
Read more...
|
|
Saturday, 17 November 2007 |
|
A few months ago I did a fairly positive review of Windows Vista Ultimate, but after more than 9 months of use I’ve reconsidered. At the moment I’m thinking the apex of Microsoft’s desktop operating systems was probably Windows XP SP2. Here are a couple of my observations:
Performance
Because we do practically daily builds of our software, I run VMware so that I can install and look at the latest features and functionality — and when required do some smoke testing. I recently installed an XP virtual machine hosted on Vista and have been using it for not only for our builds, but also as the host for some apps like Vongo, iTunes and Rhapsody that suffer from poor performance on Vista. This is obviously not a scientific test, but the XP virtual machine is fast and responsive — and isn’t plagued by the performance glitches I’ve seen on Vista.
Stability
I mysteriously got a blue screen error when booting Vista this week after applying a software patch. I was unable to get the OS to boot in any mode, and after some tinkering discovered the registry was corrupted. In particular, the entries for keyboard support were damaged and after playing with the system for an hour I gave up trying to get it running.
Giving up, I went out and bought a new (larger) drive did a fresh install and connected the old drive via an external enclosure to the machine to copy data files back onto it. The fresh install performs better than the one that had months of detritus on it, but I still get mysterious hangs when browsing between folders and the performance of Outlook 2007 is really appalling.
Application Compatibility
Spotty. Hard to attribute this to Vista or the applications themselves, but even with Excel and Access I’ve had catastrophic, data losing failures.
Aesthetics
Really when you get right down to it not much better than XP — certainly not gorgeous — or to die for.
So my revised recommendation for Vista users with the RAM and diskspace to spare is: go out and spring for a copy of VMware Workstation 6.2 (or your VM of choice) and install a guest instance of Windows XP. While you’re at it maybe a Linux distribution. That way when you start yearning for the good ole days running Windows XP you can jump right into it again without doing a complete OS install. Hey, and when SP1 for Vista comes out you can upgrade and toggle right back.
I’ve harped on the negatives, and Vista does have improvements. Too bad like many things Microsoft, it shows promise –but also feels 85% complete. For the first time since 1987 I find myself recommending Macs to users without large legacy investments in Microsoft software.
|
|
Last Updated ( Wednesday, 02 January 2008 )
|
|
Read more...
|
|
Monday, 15 October 2007 |
|
As I was building FindITAnswers, three software tools were critical to managing my spider indexes. Where spider exclusion rules act as a first line of defense for maintaining the quality of the index, a few simple utilities on the back-end are also immensely valuable:
Merge Utility: Merges multiple indexes into one. This was an invaluable utility since FIA’s spider crawl was divided into 125 index segments. The indexes were organized around key platform vendor sites, and sites with similarly structured content. Using multiple indexes has lots of obvious advantages, including:
- multiple crawls can be executed simultaneously;
- finer-grained path exclusion rules can be applied at the spider stage;
- post-crawl filtering rules can be applied to small sets of data — helping to increase its relevance;
- a catastrophic failure of one crawl segment impacts a small set of data.
This was the Lucene utility I developed, and while in-elegantly coded due to my superficial Java knowledge, works as anticipated.
Kelvin Tan, developed two small utilities for me that are also critical to increasing the relevance of search results. When you don’t have a team of astrophysicists building your search algorithms tools that improve the quality of your indexes can help your search engine a lot:
De-duplication Utility: An unavoidable byproduct of using multiple spider crawls/index segments is inevitable duplication of some pages. Rather than checking for and suppressing dupes at search-time, this simple utility looks at a merged index and deletes any duplicated pages.
Ad-Hoc Deletion Utility: This tool allows deletion of index records based on keywords, terms, wildcards and regular expressions — and allows for searching specific index fields. This is great for scrubbing pages that pollute search results — and catch anything that got through the initial spider exclusion filters.
Combining the simple utilities above with a good database of URLs to crawl, and well-planned spider exclusions can vastly improve the results your search engine delivers by feeding it higher-quality indexes.
In Part 4 of this series, I’ll discuss clustering search results — and my experience having the Lingo3G Document Clustering Engine integrated with Lucene.
|
|
Last Updated ( Monday, 15 October 2007 )
|
|
Read more...
|
|
Saturday, 18 August 2007 |
|
There’s a lot to like in Office 2007, but the learning curve for the new UI is steep. If you’ve been a casual user of the apps you can probably quickly find the few features you’ve become accustomed to using, but if you live in the apps (like I do) prepare yourself for days of hunting for the new locations of your most used features.
Over time, the new ribbon bars do become handy time savers, but in the meantime prepare yourself for a big hit to your productivity. After a few months, I still find myself peridiodically going into brain lock trying to remember the location of a simple menu item or button.
There is hope. I recently installed Classic Menus for Office from www.addintools.com. Essentially this $19.95 utility gives you back your Office 2003 menu system.
I’ve found this tool to be a handy timesaver, and cetainly worth the price. The utility adds a new Office menu called "Menus."
Selecting "Menus" will give you a ribbon bar containing Office 2003 style menus and menu bars. If you prefer to work with the Office 2007 ribbons, they are still availabe to you in their standard locations.
There are only three drawbacks worth mentioning. 1) Load time for your Office apps will increase. On my machine by 2-3 seconds; 2) While this app will certainly ease the hits to your productivity during the first few weeks of upgrading to Office 2007, in the long-run it might actually keep you from discovering some of the suites’ nifty new features if you never expore the new UI; 3) I can’t pin this entirely on Classic Office Menus since I downloaded Microsoft’s Vista patches this week, but I have noticed some strange UI behaviors after installing the patches, that seem to be related to patch/menu interactions.
If time is money to you, this is a pretty inexpensive solution to managing your migration to the new Office 2007 user experience.
|
|
Last Updated ( Sunday, 14 October 2007 )
|
|
Read more...
|
|
Wednesday, 25 July 2007 |
|
Just when you’ve just about had enough with every new-fangled website staking a claim to Web 2.0, you need to contend with Web 3.0.
The Wikipedia post on Web 3.0 talks about the “decomposition” of websites into discrete widgets, and the “data Web”, and the semantic web, but the far more interesting aspect of this is the seamless integration of Web and desktop computing experiences — and it’s happening right under our noses today. Perhaps the best examples I’ve seen are the desktop clients for Rhapsody and iTunes.
Both are desktop apps that are essentially wrappers around web-based content — and in case you’re wondering where the Web 3.0 is here — it’s in the enormous databases of music metadata, customer reviews, playlists, etc.
Both companies have done an outstanding job of creating user experiences that allows people to move seamlessly between local content, and content in the cloud. While both apps have their flaws, they do an outstanding job of allowing users to navigate huge amounts of data to find the media they want. Consider for a minute how much easier it is to navigate terabytes of data on Rhapsody to find the music you’re looking for, than it is to find out what you might want to watch on TV at any point in time.
Looking into the future, they both have an opportunity to build client platforms that expand beyond music (spoken word, and books) into other digital media — and ultimately create user interfaces for affinity groups and social networks. Music is a fascinating starting point for such an expansion since it’s extremely viral — and evocative in ways that movies and books are not. Music is intensely social, and Internet users have shown a proclivity for sharing it.
Rhapsody in particular has caused me to go out and buy lots of music from artists I probably never would have heard of had they not been recommended by Rhapsody’s profiler (two artists in heavy rotation in our house now are Keller Williams and Mike Doughty both via Rhapsody suggestions).
Desktop software isn?t dead, but Rhapsody and iTunes are great examples of the paradigm-shift ISVs need embrace to survive in a Web 3.0 world.
|
|
Last Updated ( Sunday, 14 October 2007 )
|
|
Read more...
|
|
Written by admin
|
|
Saturday, 14 July 2007 |
|
We recently moved out of our Bay Area ranch — a house that was pretty easy to set-up both wired and wireless networks in, into a bigger house with lots of hard-to-wire rooms. On a whim I bought a few Powerline Netgear XE102 Wall Plugged Ethernet Bridges. In the past it seemed unnatural to plug an ethernet cable into such a small device attached directly to an AC outlet, but I finally broke-down and started testing this in our house. So far so good. Powerline devices use your home’s electrical wiring to transmit data. I have to admit I half expected my laptop to go up in smoke the first time I plugged it directly into one of these devices, Set-up was a snap. One XE102 is plugged into an AC outlet and connected to an ethernet switch in my office, with the other devices plugged-in throughout the the house with a mix of PCs and wireless access points connected to them. Soo far the devices have preformed flawlessly, and allowed me to forgo adding wired ethernet drops throughout our house. Some reviews have noted that data transfer rates are low on these devices, but it’s higher than the throughput I’m getting from our cable internet provider. |
|
Last Updated ( Wednesday, 25 July 2007 )
|
|
Read more...
|
|
Saturday, 14 July 2007 |
|
We recently moved out of our Bay Area ranch — a house that was pretty easy to set-up both wired and wireless networks in, into a bigger house with lots of hard-to-wire rooms.
On a whim I bought a few Powerline Netgear XE102 Wall Plugged Ethernet Bridges. In the past it seemed unnatural to plug an ethernet cable into such a small device attached directly to an AC outlet, but I finally broke-down and started testing this in our house. So far so good.
Powerline devices use your home’s electrical wiring to transmit data.
I have to admit I half expected my laptop to go up in smoke the first time I plugged it directly into one of these devices,
Set-up was a snap. One XE102 is plugged into an AC outlet and connected to an ethernet switch in my office, with the other devices plugged-in throughout the the house with a mix of PCs and wireless access points connected to them.
Soo far the devices have preformed flawlessly, and allowed me to forgo adding wired ethernet drops throughout our house.
Some reviews have noted that data transfer rates are low on these devices, but it’s higher than the throughput I’m getting from our cable internet provider.
|
|
Last Updated ( Wednesday, 02 January 2008 )
|
|
Read more...
|
|
Thursday, 05 July 2007 |
|
In addition to the software-based factors influencing search like the quality of the indexing and retrieval algorithms, vertical search has an advantage over broad-based search engines because you as the administrator can constrain the content you crawl — and thus use human QA to make up for the deficiencies in purely algorithmic search. If you weed out irrelevant content, you can go a long way towards improving the quality of results. Some of the factors that you can use to influence the quality of your index:
Sites you crawl
Paths you include and exclude
Pages you include and exclude
Utilities to arbitrarily delete documents from the index based on pattern matching
Let’s look at each of these.
Sites You Crawl
I developed my own seed database starting with a few thousand sites related to software development, IT and the environment. You have a number of options for spiders to build your crawl database. I ran my spider letting it do 5 hops from each initial URL. After the crawl was done, I took the crawl logs for the URLs in the first 5 hops and used them as the basis for a second crawl — also 5 hops deep.
The second crawl became the basis for the FIA database that has subsequently been enhanced with other sites added manually.
Every two or three months the spider database is updated by using the crawl logs to add URLs in hops 2-5 has new root URLs.
In practice each iteration of the crawl produces deeper results since the crawls are starting at progressively deeper root URLs. In each of these crawls the spider is allowed to harvest pages on sites external to the root URLs.
Paths and Pages to Include and Exclude
As you analyze the results of your spider crawl it will become obvious fairly quickly which sites, paths and documents you’ll want to exclude from your crawl database and your indexes.
Virtually every search platform has the capability to create rules excluding certain sites or documents. You’ll want to exclude commonly linked-to sites like digg, technorati, NY Times, Yahoo, etc. You’ll also want to add rules to your search engine spider to ignore prevalent documents like login*; privacy*, aboutus*, *print=*, etc. — you get the idea. In practice this will become a long list — and is one of the keys to increasing the quality of the results you return for queries. You’ll also want filter rules that exclude gambling, porn, hotels, travel and other common search engine spam. Utilities to Arbitrarily Delete Indexed Documents
You’ll find that despite a good database of crawling rules, you still get undesirable results in your index. I had a tool developed that allows for SQL-style select queries against a Lucene index and allows deletions based on pattern matching and reg ex. This is a handy way to delete docs that slip through the spider filters — or sites that use overly aggressive SEO. You’ll also probably want to filter sites that use poorly constructed pages that for example use the same title on every page (alternatively you could use document heading tags rather than meta title as the basis for your index more on that in a future post).
In the next post, I’ll give an overview of how you give users access to the index, and present results.
|
|
Last Updated ( Wednesday, 25 July 2007 )
|
|
Read more...
|
|
Thursday, 05 July 2007 |
|
For the past few months I’ve been experimenting with information discovery and vertical search. Despite the power of Google and other search engines, it’s still much too difficult to find relevant information. This may change as Google and others begin using profiling information to enhance search results, but I wanted to test a less high-tech approach to increasing relevancy. Primarily, I wanted to test the extent to which controlling context, and influencing user expectations would encourage use of more relevant search terms, and more relevant results.
For example, when you go to Google, you pretty much assume they’ve crawled every site on the web. This influences the way you form your search queries, and your expectations about the results you’ll get back. I wanted to test the extent to which calling a site a “search engine for developers and IT Pros” would influence how users formed their queries, and influenced their expectations for results. In other words, could I provide a “good enough” search platform, and rely on users to enhance their own search experience by structuring context and expectation –creating a superior user experience. Initial results were promising, and hopefully I’ll have a chance someday to take this to the next level. In the meantime, here’s what I learned.
There were a couple of initial ground-rules that were set for the site that eventually became FindITAnswers.com:
- The site had to be almost completely turnkey to maintain after initial development
- Search results for the primary topics covered: software development, had to produce consistently superior results
- The entire infrastructure had to be built for less than $15,000 in hardware and software
- TTL for the site had to be less than 60 days from kickoff
I evaluated a range of search solutions both commercial and open source, and fairly quickly settled on Lucene as the search engine of choice. Autonomy would have been an excellent commercial solution for the type of app I was building, and Thunderstone’s Webinator was a close number two, but both were expensive, and Thunderstone’s license forbids its use as a search portal. I’ll talk about Autonomy in more detail in a future post.
I almost immediately ruled-out developing a Lucene-based platform myself since there were several promising projects available including Nutch and Solr. (Solr was open-sourced after the project kicked-off, so Nutch was the only platform evaluated). In the Fall of 2006, Nutch lacked some of the features and functionality required “out of the box” for FIA. That said, it’s a very powerful platform — and given more developer resources would have been an a good choice. A port to either Nutch or Solr is on the FIA roadmap.
Searchblox turned out to be the best choice for my proof of concept. It’s built on Lucene, and many of the tools, utilities and extensions I would have built are part of the product. There are some limitations to Searchblox’s scalability as a search portal — but it’s really not intended to be used this way (I’ll discuss in a future post). As an Intranet or enterprise search solution, it’s totally worthy of consideration.
For hardware I had Central Computer in San Francisco build an AMD64 x2 - based server with 8 gigs of RAM and a pair of RAID drives. The initial set-up was on Windows 2003 Server running Apache and Tomcat. I did have a Linux-based server running for evaluation, but didn’t see any performance edge, and the Win server was easier to administer. As of June 2007, the site is running on a Unix system because I found an ISP, eapps that offered attractive Tomcat hosting plans. The original server is now a mirror.
In my next post, I’ll talk a bit about how FIA was architected.
|
|
Last Updated ( Wednesday, 25 July 2007 )
|
|
Read more...
|
|
|
|
|