Stand still and watch the patterns, which by pure chance have been generated: Stains on the wall, or the ashes in a fireplace or the clouds in the sky, or the gravel on the beach, or other things. If you look at them carefully you might discover miraculous inventions. (Leonardo da Vinci)
 

2 Examples of how OpenSource could improve your overall dev-team performance

February 23rd, 2009 Development, Open Source| 1 Comment »

Developing special, complex, but independent functional subsystems can be done in two ways:

  1. Develop it by yourself
  2. using OpenSource Frameworks, APIs, and SDKs

Of course, this is dependent on the type of software you want to create, it is dependend on the domain, the company you’re working for, and the basic conditions you’re facing. However, my experience usually is: A developer team is confronted with a project, which is too large to accomplish. The common reasons: lack of man power, time, and knowledge. In some cases the team will completely reject the project, in other cases the team will recommend a lightweight version which only covers a subset of the requested features, and in a few cases the team will just begin to develop a solution on its own … flatlining due to those high requirements. In most cases, the easiest way to solve complex but well known problems with limited resources is to use OpenSource Frameworks or APIs. There are solutions for almost every task one could face in programming. Using these, the team can focus on the important things and put first things first. I want to show two examples of how a OpenSource producted helped me and my team to develop a solution which we couldn’t accomplish on our own with the given resources (Manpower, Time)

Example 1: Implementing a flexible, high performance Enterprise Database Search for an existing PHP/MySQL System.

The problem: The client had had a very large mySQL database which was very slow and unstructured: A composite of about 200 tables with up to 18 Million data sets and each table having50 fields and more. The tables were not normalized nor were the fields optimized using proper filed types and lengths. To perform search requests they added two hash values allowing users to find information very fast as long as they’re just looking for standard values. More sophisticated search requests, table analytics, and fuzzy searches are simply not usable: They need about 20 to 30 minutes to finish and paralyze the database server for other processes. This situation was not accaptable regarding the future of the overall system. A complete re-design of the database was not possible as the first step. Because money and manpower was limited. We had to find a neat solution which is cheap but powerful, which could be implemented seamlessly in the PHP based system and which can handle the complexity of the data itself.

The solution: Read queries shall be seperated from the master database. In the same breath fuzzy searches (e.g. phonetic) shall be enabled. Ideally those fuzzy read queries directly deliver results without disturbing the rest of the database. After several brainstormings, weeks of thinking and planning, the rejection of other sound but not usable alternatives, and a lot of coffee we discovered the solution:

Apache Lucene Solr

Apache Lucene Solr

Apache Solr. This is an enterprise search server based on lucene. We could achieve all objectives by appointing two developers three weeks to integrate Solr in our systems. Solr enables access to a structured, highly configurable fulltext index by using the standard HTT protocol. It was not our job to implement all those funky search algorithms and index strucures but to design a proper index scheme which meets both, the complexity of our data and the flexibility of the search queries we wanted to have. We decided to use Solr as a tier right above the database which just stores the index and the particular id of the result dataset but no data. Updates are solved by triggers within the mySQL database. Whenever something is updated or deleted in one of the db tables a trigger writes to a special update table. This table is read frequently by a batch job which transfers db updates to the index.

In the existing system we replaced all reading queries by an lightweight search API which encapsulates a two step retrieval: 1. Search query for Solr 2. Database query with the result set (of id’s). A Search query which needed round about 20 minutes to perform now needs not more then 0.1 seconds. We can create complex analysis and flexibly react on special search requests by our client which were rejected until now “by technical limitations”.

One of the moste important advantages Solr has is its independence. The system which was aimed to use the new search system was written in PHP. But there is no proper solution for PHP. But with Solr we just could use the standard HTTPClient to send requests to the Solr (which runs on tomcat).

Today we use the solr index for several databases and in three different environments: Integrated in PHP, directly in JAVA, and through XSLT as a HTML based web search form.

Example 2: Implementation of a text analysis system to extract and transport structured data out of unstructured text to a relational database.

The Problem: A client uses specific data to back up process critical decisions. This data is embedded in texts and thereby not automatically processable. The manual effort to structure the data of interest ist terribly expensive, but the implementation of a text information retrieval system which could automatize this task is just to expensive to develop by a team of 10 developers in terms of time and money. A simple, lightweight solution is almost impossible to imagine because of the high complexity of the data.

Gate - General Architecture for Text Engineering Applications

The solution: The university of Sheffield develops an OpenSource System which is perfect to use for solving the problem: GATE. This is a framework to read and process textual data. In addition to a basic processing framework GATE consists of a bunch of plugins covering several capabilities from different domains. The most important plugins are consolidated as ANNIE, which stands for A nearly new information extraction system). Basically GATE consists of Language Resources (LR) and Processing Resources (PR). The latter are orchastrated in pipelines and used to process language resources, e.g. documents or corpora. Processing in this context means that contents are annotated throughout the process. Our task especially required the use of two ANNIE processing resources: The JAPE-Transducer and the gazetteer. The Gazetteer uses several lookup tables to apply annotations for named entities. Therefore we built a bunch of general and domain specific tables: Firstnames, Lastnames, Cities, Zip-Codes, Streetnames, Legal Forms, Key Words, Toplevel domains etc. The JAPE-Transducer in turn uses annotations to identify patterns of higher level qualified information. Patterns are described in the JAPE language, which is based on regular expressions but applied on annotations and their features (properties).

The information identified by the JAPE Transducer is anlayzed, structured, normalized at the end of the process to prepare the transaction to the relational database. Our result: The system reads, processes and stores about 5000 documents in 20 minutes. By the addition of a compouter aided manual process for all documents with ambigious information we reached a rate of almost 100 % and a quality that is much higher then the former manual reading of the documents.


My first entry

February 18th, 2009 General| 1 Comment »

This is my first entry to this new blog at bastian-buch.de.

Just to introduce the wider topic and sharpen your brains I suggest to watch this video. It is a presentation of Scott Berkun at Google. He is the author of the book “The myth of innovation”, which is a nice piece of work.