New Robot Grabs Large Internet Stores: Tokenizer

Share Article

Tokenizer is a new experimental web crawler and data mining engine. As initially planned, Tokenizer should provide real-time financial indicators of different economics.

Tokenizer is new data mining search engine crawling the large Internet stores worldwide. Tokenizer is Robot and will never accept RSS feeds of merchants and publish consumer reviews.

The idea of having financial tools helping to find and estimate heuristics of different economics was born in a Summer 2005, Red Piano Club (Toronto, ON), on an annual meeting of Moscow University Alumnis.

In September 2006 Tokenizer Inc. started an experimental crawler under agent name Bambarbia. Free public service became available in January 2007 after heueristic analysis and generic detection of some Canadian online internet shops. Tokenizer currently crawls over 120 large internet stores in Canada and USA.

New domains will be added during few weeks starting from now. In a few months, Tokenizer will finish initial analysis of big Internet stores in Europe, America, New Zealand, UK, and Australia.

Tokenizer is specifically designed for data-driven websites, including (and not limited to) big Internet stores selling computers, electronics, software, books, and etc.; it does not follow external links; it does not publish consumer reviews; it does not calculate shipping & taxes. It uses specific statistics in order to mine Product Name, Category, Manufacturer, Price, and other related information from a large e-Commerse sites. It can't mine small sites with static content.

Currently, only Price Comparison services are provided to the public including extremely fast Search. Site redirection and other kind of affiliate tracking is not used and it greatly simplifies user experience: you can go to Merchant's product page directly from a search results page of Tokenizer.

The User-Agent signature of a Robot: Tokenizer/1.1.9 (Price Comparison Engine ...); Tokenizer honors robots.txt conventions. The core of a search engine is powered by Apache Lucene, and more than hundreds of other excellent open source frameworks are powering the Web and the crawler. New Rich Client will be Adobe Flex based, including Adobe LiveCycle Data Services platform running on Apache Tomcat under SuSE ES.

Multilanguage suport is an embedded functionality, and human intervention is absolutely minimal. Tokenizer is still a one-man part time hobby, please be patient.

Tokenizer is unique: it is the only crawler-based on the Price Comparison arena, and it does not depend on RSS feeds of merchants, countries, languages, networks.

About Tokenizer Inc.:
Tokenizer Inc. is a private Canadian company providing independent consulting services in Software Engineering and Project Management areas for major banks.

Fuad Efendi, director
Tokenizer Inc.

# # #

Share article on social media or email:

View article via:

Pdf Print

Contact Author

Fuad Efendi
Visit website