Ex-Crawler Server Changelog PDF Print E-mail
Wednesday, 12 May 2010 19:22

0.1.6 Alpha ( released on 2010-06-10)

# Many Performance improvements, should now run up to 60% faster
# New and stable link cleaning for clean sites
# rework of the complete database layout
# updates of plugin interfaces, to make it even easier to write plugins
# hundreds of other changes and updates,
+ added PDF Crawling, new dependency: Apache PDFBox, FontBox
+ added may pdf functions to read and analyse pdf files
+ added language & better encoding detection
+ better mime type detection
+ implemented CrawledObject and HostObject
+ Added some Ex-Crawler utils - DBCreate is now working (automatically generates the database, mysql only at the moment)
+ added Filters, for example to filter out new sites or give them a worse relevancy rating (include/filters)
+ added static Filter tools
+ addeed new Link cleaning & gathering algorithms
+ added optional Auto performance detection (PERFORMANCE_AUTODETECT, default: on)
+ added Commons Daemon, ex-crawler could now run as unix daemon or windows service
+ added daemon service scripts for init.d (etc/init.d) to run it as a service (on Linux / Unix) with jsvc
+ with start | stop | status | restart | force-restart
+ started implementing jadf (jadif.sourceforge.net) to ex-crawler

0.1.5 Alpha ( released on 2010-05-12 )

# Milestone Release: Agent Brown!
# complete rework and reorganisation of Ex-Crawler
# many bugfixes and improvements
# security enhancements
# rework of the database schema (sql)
# rework of Initcrawler
# rework of AnalyzeWebsiteCore into static functions
# rework of AnalyzeHeader into static functions
# rework of UrlParser into static functions
# rework of Server security, file security and sql security
# moved many static methods into new packages / helper classes
# changes on database table hosts
+ new logging system based on log4j including new file logging system
+ added many (!) static helper classes and methods
+ added scheduler for plugins, system tasks and statistics
+ statistic generation (every minute, hourly, daily, weekly..)
+ many new plugin interfaces (including time scheduled ones, on new hosts etc.)
+ image crawler and analysis
+ thumbnail generation for image cache and much more
+ new images are now added with image alt and image title
+ added image crawler and automatic thumbnail generation
+ Improved and fast Robots.txt detection.
+ Image informations are now also saved with image alt and image title (if exists).
+ Created new package Hosts, including one class with static methods to check for robots.txt file.

0.1.4 Alpha (released on 2010-04-22 )

# ThreadWorker rewritten for more performance
# many bugfixes and enhancements, parts of the code complettly rewritten
# optimized and improved code for stability
# downloaded images and websites are now saved in tmp
# after analyzing they are saved to their cache directory
# win: should now run stable and handle files correctly
# mac os: improved performance and file handling
# updated logging levels for better fine-tuning of log levels (not completly finished)
# updated file directory structure for higher security and speed
+ image Downloader
+ added plugin interface
+ for basic plugins (executing once at program start)
+ for webcrawler plugins (executing every crawled sites, to run your own tasks over it)
+ added basic plugin cleancache (cleans website and imagecache - for non-caching websites, standard inactive)
+ added webcrawler plugin letterfrequency (standard inactive)

0.1.3 Alpha (released on 2010-03-28)

# optimized table and data structure
# optimized code for more speed
# you should run ex-crawler with at least 256 MB of memory, setting java -Xmx256m, more is always better
# complete rewrite of the threads System
# seperated downloading of websites into own threads
# seperated Website crawling into own threads
# many other changes
* fixed a bug that caused crawler to run out of memory after some time
* many bugfixes
+ added link rating for better crawling (analyzes links and gives them a priority from 0 best to 30 worst)
+ added priority field in table crawllist
+ added config option to set number of Download Workers and Webcrawlers
+ added feature for optional country priority crawling (set your favorite top level domains, they will get a much better link rating)
+ added option to not crawl images
+ added option to not crawl websites (you will have to enter each url it should crawl manually!)
+ added basic distributed crawling features
+ added proxy support Alpha (released 2010-03-10)

* Windows: fixed bug that caused program not to run on Windows
* fixed some smaller bugs

0.1.2 Alpha (released 2010-03-07)

# updated database tables
# updated getLinks to Jsoup parse (providing better links and more informations like linktext)
* fixed a bug that caused crawler to run out of memory
* fixed a bug that caused crawler to stop
* fixed many bugs
+ added link informations - linktext (alt and title) and text link is wrapped around
+ added costum http.agent and Referer Settings (default ex-crawler and http://www.ex-crawler.de)
+ added option for showing Ex-Crawler version (default on)
+ added new crawler.conf options
+ added new server states for distributed crawling

0.1.1 Alpha (released 2010-03-02)

* fixed MySQL connect bug
* fixed bug that causes ex-crawler to crash on not clean content
+ added option for log size in crawler.conf
+ added option for log directory in crawler.config
+ added saving clear stripped website Text (without html etc.) to Database

0.1.0 Alpha (released 2010-03-02)

This was the first Pre-Alpha Release

Comments (0)
Write comment
Your Contact Details:
[b] [i] [u] [url] [quote] [code] [img]   
Please input the anti-spam code that you can read in the image.
Last Updated on Thursday, 10 June 2010 09:09