 |
Title |
Original Content |
Ness Comment |
|
Introduction |
This document is intended to serve as a general `collection pool' for information about the research project by Steven O. Kimbrough and David Ness. The general subject of the work is text data bases and their relation to documents, information systems and the Web. |
Only DNN `comments' are being published at the moment because there are only DNN comments in the data base. However, I expect to expand this to experiment with other forms when more comments become available. |
|
Things about the Problem |
Corresponding with the precipitous decline in the cost of storing information is huge growth in the information being stored. This is no surprise. As the cost of storing information drops below the $1/gigabyte barrier, the fact that we are able to afford huge text stores becomes clearer every day. However, this doesn't necessarily mean that we are able to make more effective use of all of this text. Indeed, it seems possible that our ability to generate and store information now may well exceed our ability to make effective use of it. It is in an attempt to make this information more and more useful that we start the Drupelet Research Project. |
Is there other impetus for the project? If so, should it be discussed here or later? |
|
What is a Drupelet? |
Almost all of us have encountered drupelets in our life without even knowing it. A (real) drupelet is, for example, one of the little tiny pieces of fruit that make up a raspberry. The dictionary says that it is one of the individual parts of an aggregate fruit. Of course, this paper---and this research---isn't about fruit. It's about text and information. However the notion of a drupelet taken in metaphor, is a comfortable way of describing some of the core aspects of our project. That's why we have focussed on them. And---since no other term really seems quite right---we have decided to call the `atoms' of our text structures drupelets. |
I hope using drupelet does not get in the way. I wanted something not quite common, something without much particular connotation and something that was `correct' (at least in a metaphoricl sense). |
|
|
Drupelets in Practice: At its lowest level, a drupelet can be thought of as a `labelled paragraph and associated meta-information'. Let us beg the question, for the moment, of some of the complexities that are even introduced at this very rudimentary level. We can concieve of, and admit, drupelets that happen to have no label. Or drupelets that happen to have more than one paragraph of text. But these are small, relatively rare, anomolous situations. For the most part we get along quite fine thinking of a drupelet as `a labelled paragraph with attributes'. |
Anything more complex that needs to be covered? |
|
|
Looking at a Drupelet: As the previous discussion makes clear, a drupelet contains a Label and a Body. However there are some other elements which, even at this very early stage of our research efforts, we typically find in a drupelet. For example, we give a Doc with a consistent `name' (often it is a `filename' in our computer system) to each drupelet that is a part of a particular information field (usually what we think of as a `document' in the conventional sense). Then we have chosen to give a Local Drupelet Name to each drupelet. this LDN is guaranteed to be unique within a particular Doc. The particular paragraph you are reading now has been given the LDN of LOOKAT in the document drp. We also give every drupelet a Permanent Drupelet Identifier which is assigned when the drupelet is created, and which (presumably) lasts for the life of the drupelet, never being reassigned nor reused. for this partciular drupelet it happens to be 20030514-182617-000001-DNess. |
Should any other `common' fields be described? Is it clear why the ones listed here are discussed? |
|
|
The Special role of the Body Attribute: The body of a drupelet is, metaphorically, its nucleus. It carries the `content' of the drupelet, and the rest of the attributes in the drupelet generally contain meta-information that relates somehow to the drupelet. |
Are there categories of attributes? |
|
|
A Cohesive Atom: Notice that all of the auxillary information coheres to the drupelet. This isn't so obvious at first. Consider, for a moment, the role of a paragraph of commentary written about a paragraph that is a part of the document we are building. The paragraph of commentary could be treated in two very different ways. First, it might be made a drupelet itself, and have a pointer placed in the document to mark it. The alternative is to place the information in the commentary paragraph within the drupelet itself---to make it a `part' of the drupelet. This second way is the appropriate one for our purposes. The paragraph of commentary has no real `life' outside of the paragraph on which it is a comment. If the original changes, we probably ought to at least consider changing the comment. This is more `natural' if the comment is nearby. So this cohesion is an implicit part of the mechanisms associated with drupelets. And while we could make just about everything into its own drupelet, and then point to it, this would often represent an `overengineered' choice---something we are better off treating more straightforwardly and directly. |
I hope this makes an effective case. So far, in practice, there has been no real temptation to complicate matters by having pointers to pointers to pointers ... The simple and direct approach has proven to generally be quite adequate. I'd be particularly interested in other opinions. |
|
|
One Level Only: Drupelets have attributes that are restricted to one level only. In other words, the attributes do not nest in any way. They don't have any hierarchy within them. While this might appear, at first, to be a severe restriction, some reflection on the nature of drupelets makes clear that it is not. Drupelets are made of bodies of text, along with descriptive information. The information which describes each body is inherently non-hierarchical. And while the text may well relate to other text items in hierarchical ways, this structuring is determined by the abstraction driver that is used to produce any particular document. |
This seems to sensibly represent the cases that have been encountered so far. |
|
|
No Multiple Occurrence: Attributes in drupelets appear at most once. Multiple occurrences of the same attribute are not allowed. So far this has not proven to be a restriction of any severity. |
If there are circumstances where we would like to have multiple occurrences of some attribute, we haven't encountered them yet. |
|
What is an Abstraction Driver? |
An abstraction driver is a representation of a document which describes how drupelets are composed into the document. The driver consists of drupelet references---which may be local or global---and other information, for example things like `level', which help us to manage the composition process. |
This is a new, and not yet very widely tested, concept. |
|
|
The Default Abstraction: When a document is originally created we generally have a specific rendering of the document in mind. This isn't always the case, but most often we have just such a picture, particularly for documents that we are constructing by hand. In general, we write the document so this when it is rendered it follows the order of the storage form of the document. Indeed, in most cases (for most types of composition systems) we have little choice. The document is necessarily rendered in the order of its storage form. Thus the `default' form of an abstraction driver that relates a text data base to an output document is the very simple transform that consists of simply dumping the drupelets in the data base in order into the document. It is a sort of identity transform. |
Most documents, so far, are complete dumps of their corresponding data bases in strict order. |
|
|
The Problem of `Level': Many of the documents produced have a natural hierarchical structure. That is one of the reasons that Level is often one of the attributes in the drupelets that define documents. However, there is an interesting intellectual problem associated with level in any hierarchical organization of the information in a document. That is whether the level is inherent in the drupelets in their normal information field, or in the abstraction driver which makes use of the drupelets. Both answers make sense in different situations. |
Is this the only problem associated with level? Are there similar problems with other fields? |
|
|
Inherent in the Information Field: In one view, it would seem obvious that level is inherent in the drupelet. It would be an odd situation for the way we compose a document would significantly modify the relative organization of information which is implicit in the structure of a field of information. |
Should we distinguish absolute level indication from relative level indication? |
|
|
Inherent in the Abstraction Driver: However, there are also situations where the hierarchy in which a particular drupelet is presented might more reasonably be specified in the abstraction driver that is used to create the document. An easy way to see this is to think about text which gets quoted from one context into another document where it may well appear at a different level of the hierarchy. In such circumstances we might well want the assignment of the level in the hierarchy to take place in the processes associated with applying the abstraction driver to the information field. |
Again, there is an `absolute' vs. `relative' consideration here. |
|
What is a Document? |
In our current view, a document is an ordered collection of drupelets. As time passes, and we learn more about how we typically manipulate and handle drupelets, we will become more specific about the relationship between the drupelets themselves and the `energy' that binds the drupelets together into a realized document. |
At the moment I think of a document as `an information field of drupelets composed into a realized document. |
|
|
Realized Document: A realized document is information as it appears on a screen or a piece of paper or in some other media ready for human consumption. A given collection of drupelets in an information field can be realized into many different documents. Correspondingly, there are some situations (typically meta-problems) where information from several fields all gets realized in a (typically status/summary report) document. |
Is `realized' an OK word to use for this concept? |
|
|
Information Field: An information field is a collection of drupelets that somehow relate to one another substantially more than they relate to something outside. In a simple world, where each document only had a single realization, there wouldn't be much trouble describing what I am talking about. But since we admit of the possibility that some subset of a partcular collection of drupelets might be realized in several different documents, it all gets a bit more complex. To push our metaphor, hopefully not too far, the information field is the `fruit' that a collection of drupelets make up. Of course, we may have occasion to make reference in some particular document to a drupelet that is a part of an information field other than its own, but this is a (relatively) rare circumstance. For the most part documents are likely to be made up from information within their own field, with only occasional reference elsewhere. |
The way that I am treating this it is sort of complicated and I may be overcomplicating things to no point. Any suggestions? |
|
|
Composing a Document: Composing is a term that comes to us from the 15th century, probably mostly via music and later associated with typesetting as well. When we use it here, we refer to the activity of taking a collection of drupelets and realizing a document from them. An alternative term for composing in the sense we use it here would be object publishing. We compose a realization of a document, only simplifying slightly, if we think of taking objects, formatting them into appropriate formats and then surrounding them with the proper kinds of `glue' to allow us to place them in the visual field in ways that the documents we are creating demand. |
Is this really correct? I'm not quite sure that composing doesn't connote more of the `how things are glued together' aspect of the process than does `composing'. |
|
|
The Updating Problem: Increasingly we have do deal with information which is updated on a frequent basis. Managing publications what contian this kind of information can truly be problematical. When the body text of a drupelet changes, theoretically we ought to check every place that refers to this element, to make sure that the change doesn't have an inappropriate rendering in each particular circumstance. To accomplish this task completely we would have to have some form of inverted index that allowed us to find each of the uses of the druplet in question across all information fields and documents. For any organization responsible for a substantial quantity of information this would indeed be a daunting prospect. As a practical matter, however, we may not need or want to do this. Most of the impact of the change in the content of the body of a particular drupelet should be confined to the drupelet itself, and next to the information field that contains the drupelet. We probably don't need to go much beyond this except in the most rare of circumstances. |
Does this argument hold? Or am I skirting around what is really a more complicated issue. I really can't tell yet. |
|
Technology Choices |
There are lots of technological choices that have to be made to get this research project rolling. We have already made a few such decisions, but there remain some simple technological problems that still need to be solved, and some of the choices may well impact just how hard it will prove to be to handle these particular problems. |
We may well need to consider technology choices in other dimensions as well. For example, so far we have used a `grow our own' philosophy for scripting language, but perhaps should consider adopting some form of external standard for that as well. |
|
Technology Problems |
Before we dig into the technology choices, perhaps it would be worth some time looking at problems which we currently have and don't yet have a clear path to solution. |
Perhaps this should't be here. Maybe earlier? |
|
|
Browsing Objects: The technology for browsing objects that contain drupelets seems to be inadequate to even the current rudimentary level of implementation. At the present time, straightforward ASCII editors are used to manipulate the text of items, using an occasional power boost from some hand-composed and ad-hoc K. It seems like this should not be a difficult matter, and yet it has so far proven to be elusive. |
Might PythonCard help? |
|
DataBase |
At our current level of experimentation, data is managed by programs which we have specifically written to do that job. This is quite adequate at the low volume of information that we need to be able to handle during our early stages of experimentation. The ability to adapt code to circumstances is very valuable when we are in the early stages of acquiring knowledge about a problem area. As time passes, however, and complexity grows it will be necessary to use some more sophisticated data base management technology to handle the problems. |
Does this seem like an adequate explananation for our interest in these languages? Is the list that we give the appropriate list? |
|
|
PostgreSQL: PostgreSQL is a candidate for use as a data base manager. It has a principal advantage in that it is open-sourced and available for use. It brings along a concamitant set of disadvantages generally associated with the difficulty of getting problems solved professionally even when sufficient funds are available to handle things that way. |
Since my knowledge of PostgreSQL is rudimentary, it would be wise to have someone who knows more check th details. |
|
|
MySQL: MySQL seems to be another candidate for consideration. While it is also a `free' (except for commercial purposes) product, it implements a somewhat more limited SQL than some of the other alternatives. A plus factor is that lots of the Web services depend on MySQL servers, so there is both reasonably rapid evolution of this product and there is a reasonable chance that problems with it get solved in a timely fashion. There's also lots of knowledge `floating around' about how to do things with MySQL precisely because there has been so much work that uses it. |
I've used MySQL a little, but don't really claim any real expertise with it. As a result these facts should be carefully checked. |
|
|
KDB: Since we are already using K there is a good argument that we ought to consider KDB as a data base management language. Up to some point this is a very good argument to make. The problem is that once one get to some certain size of problem, it would be necessary to buy the commercial version of the product, and that is quite beyond the reach of any academic effort from a cost standpoint. |
Is the place where this cross-over takes place somewhere in the bound of scale where we are likely to encounter it, or is it unlikely we will ever get to such a scale? |
|
Programming |
Choice of programming language is complicated by the fact that, at least for the most part, we may choose to work with several languages. Opting in favor of one does not necessairly rule out consideration of others for other problems. In one sense this makes things easier but in another it actually makes them more difficult. Here we pay special attention to the problems we need to be able to solve. Various programming languages will make some of these tasks easier, and the ability to use the different languages will prove to be quite helpful. |
At the moment, K and perl are `in'. But we haven't faced up to all of the problems the we are likely to encounter yet. |
|
|
K: K is used as a basic programming language for lots of the data base functions. K proves to be particularly useful for managing data in both Object and Short form, and is used to perform conversions between those forms. In addition K is handy for performing all kinds of edit and manipulation functions on the data. To this point, we have made no use of KDB's facility for directly performing data base functions. K functions already exist to handle basic data manipulation instructions, and it has been used to implement a number of functions including at least one drupelet display function. |
Whether KDB has a special role in the kind of system being created here remains to be seen. |
|
|
Perl: Perl is a string manipulation language that proves to be very useful in manipulating the kinds of text that form a text data base. Perl is an effective compliment to K in that it provides many of the facilities that are not provided by the other language. |
Should perl be describe in any further detail? |
|
|
Java: Given the availability of libraries and class definitions, particularly those relating to XML, Java has to be another serious candidate for implementation of many of the facilities needed in the suite of programs that deal with drupelets. JavaScript is also a likely language to be used to deliver facilities that are needed in the browser, as there is considerable support for Java in most browers. Our experiments translating drupelets into Index Cards are one example where this facility has been used with some success. |
Should JavaScript be treated as a language separate from Java? |
|
|
REBOL: REBOL is an odd, but very powerful language invented by the implementor of what is still conceeded to be a remarkable piece of software that many think made the Amiga a computer far ahead of its time. This system has an idiosyncratic look---extremely clean and `bright'. It takes a lot of work to get used to the odd programming style, but once you do, it is possible to get a lot of work done in a very short time. |
Is there more to say? |
|
|
Python/Ruby: Python and Ruby both have good reputations for dealing with the `structured data to screen display' problem. The two languages seem to share some common characteristics, but also each has their own sent of---often very vocal---supporters. Some study might be necessary to suggest if these have a place in the arsenal of software tools. |
I'm leaning towards Ruby but PythonCard makes me hold back a little. |
|
|
Unicon: Unicon is a follow on to the Icon language. It is a little too early in our understanding of this software to be able to say much of value about it, but it does seem to be worth investigating as something that might be useful. |
I am reading the Unicon manuals at the moment. |
|
|
Lua: Lua seems to be a `language extender' that fits comfortably into a number of different language environments. The language seems quite straightforward, and doesn't look funny in the same way that some of the more idiosyncratic add-ons do. It is a small, compact implementation---which makes it attractive---and offers the prospect of handling some of the input capture tasks associated with drupelets well. |
The FLXA library is particularly attractive and small. The worst current problem is that Lua 5.0 has just been released, and there is not yet a 5.0 implementation of this library. However, there is some hope that one may not be far away. |
|
Storage Forms |
There are some particular data storage structures that have already proven to be useful in the very preliminary expository implementation that we are currently experimenting with. Some very preliminary documentation can be presented here. So far, all of the forms are essentially representations of a `flat' file structure. That has proven to be adequate for the current implementation, and we are operating under the `let's not complicate things until we need to' philosophy. |
The two essential forms, `object' and `short' are described. In addition the `Long' form has some use, and there are other particular forms that may prove to be useful in particular contexts. |
|
|
Object Form: The object form of storage can be described succinctly. Each object is described in a set of curly-braces surrounding a list of name, value pairs. The name, value pairs are names of attributes of the object and the values, which may be quite long, are strings of characters representing the value corresponding to the name. The one constraint on the value strings (at least at this point in time) is that they cannot contain any curly brace characters, as they would confuse the parsing. |
At later stages of the project this constraint might be removed. However as a practical matter it is generally quite possible to use some surrogate character in place of curly braces if they really prove to be required. Further, it is not unlikely that we will eventually want to implement some form of octal/hexadecimal escape. This might ease problems with all kinds of special characters. |
|
|
Short Form: The short form of storage is simple and effective in lots of situations. Putting aside the issue of `comments' for the nonce, the first line of a short form file is marked by an asterisk (*) in the first column. The rest of that line consists of `Short Names' separated by a separator character (currently a vertical bar). This line is followed by any number of `data' lines, each of which has data separated by the separator character. In a well formed file, there are always the same number of separator characters in each line. In current implementation no field can contain a separator character as a data element. Since the significance of data elements is indicated by their relative position, all fields are `filled' if need be by a null. In this form we do not distinguish the case of a `null' value from that of a `value not present'. |
When it is appropriate to the problem, this particular form of data storage is very easy to manage. It is simple and straightforward, and both loads easily and translates well into many different specific problem domains. |
|
|
Long Form: The long form of data storage is an easy way to manage some kinds of data. It makes the data quite directly editable, and as long as appropriate `proofing' mechanisms are used, there isn't much opportunity for making mistakes. The format of long files is simple. An object is represented by any number of line of the form: Name: Value. One object is separated from the next by a line full of hyphens: -------. Names cannot contain colons, but could---at least theoretically---contain any other characters. Value fields can pretty much contain anything. By convention, the first object in any long form file is a `descriptive item' that contains all possible Name keys and (again by convention) show values that are the `short names' for the corresponding keys---should they prove to be useful in any context. Order of the lines within an object is inconsequential, but most system functions will write items out in the order implicit in the `descriptive item' that heads the file. It should be noted that often the long form field Names are not legal short form field Names, that is the reason for presenting the short form in the descriptive item. |
This hasn't been terribly useful, but it is the way that I find it most comfortable to manage phone lists, for example. |
|
|
XML Form: There is an XML form that can be used to store data. It has some problems that need to be resolved, but software already exists to map between object form structures and XML structure, but we are encountering some difficulty in going the other direction. Work will proceed on this front, and we expect the problem to fall reasonably quickly. |
The problems have to do with the presence of HTML markup and with the potential occurrence of multiple copies of elements. |
|
|
Definitve vs. Derived Data: The `normal' mode of operation is to store original drupelets in object form, and then remap them into whatever other form might prove to be useful in a partciular circumstance. How well this strategy will work as the scale of data being managed grows remains to be seen. |
Can we estimate when we might start to get in trouble using this strategy? |
|
|
Other Forms: There are some other forms of storage that have proven to be of some use in particular circumstances. Most of these are `experimenta; in nature, and involve things like how to represent things that don't quite fit into the `flat-file' rubric. In addition there are some important issues about how forms that involve constrained physical presentations (computer code is often a good example) should be managed. These are not yet resolved, but need to be actively considered. |
Need examples be given? |
|
|
Scale Problems: Since we are at a very early stage of this project and since our implementation is largely experimental, we have not yet had to deal with any significant problems associated with `scale'. For example, if we want to change a data item in a data base we generally pick up the whole data base in an editor, change what we want to change and then file it all away again. It is not yet clear how rapidly the scale of the problem will grow once we start to deal with real data. |
So far problems seem to keep being `separable' and that means that scale hasn't been growing out of control even as we add new problem domains. How long things will continue to behave this way is not yet clear. |
|
Agreements |
In this section we will record some of the agreements that constitute the zoning ordinances of our problem space. We will introduce them here as they become the focus of our agreements, Only those that are accepted as rules will ultimately remain, but we may, from time to time, also include some items not completely agreed on so that they can be discussed explicitly. |
I indend `agreements' to be (repealable) zoning ordinances, also subject to variances. This means that they are `loose' standards, not to be enforced rigidly, but rather intended as guides to how development should proceed. |
|
ASCII Text |
The `deep' structure storage of the project should be ASCII text as much as it is sensibly possible to make it so. Of course, this does not apply to .PNG, .JPG or other kinds of file which are naturally in a non-ascii format. |
How much of the pro-ASCII argument should we bother to get into. We have all kinds of other issues that could be raised, archiving, interpretation, processing, ... but I'm not sure it would add much to consider them all (again) here. |
|
Good for Many |
Another agreement is that solutions proposed should be good for many users, but that they need not be good for all. One will have to make a judgement call about what constitutes a reasonable discrimination between `few' and `many'. I probably generally depends on the specifics of the circumstances. |
There is probably more that can be said about this, and it would be useful to have some other clues as well. |
|
Core Support |
If external pieces of software are going to be used, we should only have to write the two pieces of software that allow us to map our data into the form of the external software and out of that form back into our drupelet format. This agreement implies at least two things. First, each piece of external software that is to be `supported' needs to be able to save its complete state in ascii, and then it needs to be able to reload that state from an ascii record. Unless it has this capability it may be difficult to read and write appropriate records in both directions. |
This strategy implements a view that is conservative of knowledge that users have acquired in the past, in that it allow them to continue to use software that they are familiar with, so long as that software allows appropriate export import. |
|
|
Example: Core Support: An example might prove useful. There are some aspects of Microsoft Outlook that are very nice. In particular, I happen to like the way that information is displayed. The daily, weekly and monthly schedules are quite nice and convenient. It is also nice to be able to use Microsoft Outlook because there are many pieces of software that interface with it---and nothing else. While this is clearly a bad idea on the part of the designers of such software, if you are able to use Outlook it can be quite handy. However, the downside is that micosoft is well known for its arcane and often undocumented storage and transaction structures. Here we can deal with it in an effective, albeit obscure, way. First, the `definitive' source of information is stored in a drupelet based file. When I want to work with Outlook this I start by purging all entries from my Outlook data file. Then I generate `transactions' from my data base that add the event records to the Outlook file. This guarantees that the outlook file is absolutely up-to-date and in synchronization with the base data. Although I never enter new transactions directly into the Outlook data base, if I wanted to it would be straightforward to write a process that would take a dumped data file from Outlook, and would generate transactions appropriate to updating the ascii event data base as needed. |
While this seems cumbersome, it is for the most part quite automatable, and thus doesn't impose a huge burden on the user. |
|
Paradigm Problems / Use Scenarios |
It seems to be quite useful to create use scenarios that point to the paradigm problems that describe a problem domain. A collection of these can be used to guide and test our thoughts as we come up with some attempts at the solution of parts of a problem. |
Did I get `use scenarios' right? I have the feeling you had other (better) words for what I am trying to say here |
|
Paradigm Problems |
Isolating paradigm problems is one useful way of proceeding with a systems design problem. Paradigm problems have the property that they are exemplary of some particularly important aspect of a problem domain, and that they have as little extraneous character as is reasonable. Thus by focussing on paradigm problems we increase the liklihood that we will cover all of the important aspects of the problem domain, while at the same time wasting as little time as is reasonable worrying about things which are not essential. |
This has been a useful approach to solving problems. We have used it over and over again for decades. |
|
Use Scenarios |
Use scenarios are much like paradigm problems except that the come at the `problem' in a different way. In such a scenario, various scenarios which expose how we expect our system to be used are presented. This provides us, at a minimum, with some `pictures' that we can use to test our ideas. They also prove to be helpful in generating approaches to some of the problems that we are likely to encounter as we attempt to solve those problems. |
How is this as a description, and does it even use the right words? |
|
Other Sources |
This section lists other sources of information which are relevant to this problem. |
Subdivisions of information sources would be nice. |
|
Work in Process |
In this section we generally indicate what work is in process at the time of its last update. |
This should represent generally acknowledged work that we have agreed to perform on behalf of the project. Perhaps adding some `dates' would be useful as well. |
|
DN Work |
Ness has agreed to start work on EMail. At the same time he is researching some of the languages, particularly Ruby, Unicon and Lua. |
EMail is somewhat more complex than might have been noticed at first, and raises issues like `attachments' and `format' right at the get go. |
|
SOK Work |
This section lists work Kimbrough has agreed to do. |
Again, dates would be useful. |
|
Overall System Structure |
One of the interesting aspects of overall system structure is implicit in the observation that while no single piece of software or suite of existing programs proves to deal with this whole problem domain in an effective way, there is lots of good software, often specific to particular kinds of hardware (PC vs. Mac, for example). As a result there is little sense in trying to insist on specific pieces of software to handle specific problems. What makes more sense is to adopt an overall strategy that leaves people free to use software that they happen to like to handle problems the way they want. This also allows us not to orphan previous knowledge that may have been hard fought to acquire to handle some parts of the existing problems. |
IME one is never able to control either the existing environment or the evolving environment well enough to be able to adopt a rigid strategy in this regard. |
|
Server Issues |
The drupelet display technology raises the possibility of performing major display functions on either the server or client side. There are some advantages to each of these strategies, so spending a little time considering the tradeoffs might be useful. |
It is not quite clear how central this issue is to the discussion at the level of this presentation, but it is an important matter the has an impact on the design of the various possibilities that might be of interest. |
|
|
Server-Side Function: If we do work on the `server' side, then some problems become quite easy. For example, it is easy for a common server to manage a common data base. And code changes are easy to propogate as they only have to be propogated to the servers, not to all of the client machines. Unless we have a very large system, it is generally quite possible for us to update all of the servers under a more or less direct control. However, if we do our work on the server then the maintenance of any user state becomes a difficult problem---if we need to maintain it. Consider the problem of maintaing user data over several sessions as an example. |
There are probably more relevant reasons as well. |
|
|
Client-Side Function: When work is done on the client side, saving user state is not much of a problem in most systems. However, updating common data and providing fast and snappy response becomes more of a problem. If we work on the client side we still have to alllow for the fact that more than one user may use a particular piece of hardware/software, but that isn't a very complex problem, and can be handled straightforwardly. Maintaining a user data base is also a problem, particularly if anything ever gets screwed up. With the complex systems that are available today, it is quite possible that a screwed-up client side can be an enormously difficult thing to debug. |
Other client-side problems? |
|
Nature of Display |
Aside from the issues associated with output / display devices, there is the question of what kind of `language' is used to drive the displays. For the most part, it would appear that some form of HTML that can be viewed from a browser is an appropriate choice. |
Are there completely off-the-wall alternatives that we haven't yet thought about? |
|
|
Static HTML: Most of our current experimentation has been with `static' HTML pages. These are pages which can be generated ahead of time, and which need only be regenerated when key elements of the data which is being displayed change. A typical example here might be the home page of a typical Web Site. Unless there is new information to be presented, there's no reason to re-generate this `front page', and correspondingly whenever something major changes, all we need to do is re-generate the page. |
This is a typical Home Page. |
|
|
Dynamic HTML: If we need style sheets, content positioning and/or downloadable fonts, then we move from static HTML into the realm of dynamic HTML. This requires a more intimate relationship between the server environment and the client machine. |
Is this a distinction worth bothering about? |
|
|
Interactive HTML: If we are going to interact with the pages that are presented---other than in the trivial button pushing sense, that is---then we need to get more directly involved in interacting with the pages being displayed. For example, giving the user the ability to move elements of the displayed page around to accomplish some user objective. This implies a support technology adequate to handling this kind of interaction. |
Is this a different class or a different dimension? |
|
|
XML: XML represents a more elaborate structure, where the language carries some semantic, as well as syntactical, information. |
Does XML really make any difference? |
|
|
Problems with HTML fields: While storing information in XML offers many advantages, there are some problems caused by taking this approach. Not the least of these problems is the fact that XML elements apparently have to be legal HTML in an of themselves. Since we are separating the composition and imposition process from the data, we have some contexts were this is not the case. For example, much of our drive technology places beginning and ending paragraphs around the text that it generates. This means that the great majority of the drupelets, specifically those which consist of a simple parargraph, don't need to contain any markup whatsoever. This makes them simple to write and simeple to manage. However, we run into difficulty if we have a paragraph break in our text. At the moment this is shown by an end-paragraph begin-paragraph pair. However this makes the drupelet body, taken as a whole, an illegal construct. |
I'm struggling to find the right way around this whole mess. Perhaps some completely non-interfearing markup has to be used. |
|
|
Data Driven Display: Many useful displays are `data driven' in the sense that the user/reader specifies some context and then a data base is used to provide information that can be formatted into an appropriate display. This is already something well known to most Web users, as---just to choose one example---most search engines are just such a device. |
Is Goodle the best example, or is it a confusion to raise the search engine issue here, particularly because search engines are only one kind of example. |
|
|
Idiosyncratic Languages: It is also possible that we might want to concieve of generating our own specific languages for communication to display. Any such language would have to be designed after some experience. |
Perhaps we should outline the circumstances that we might encounter that would suggest this be done, and what such languages might look like. |
|
Document Types |
There are a reasonable number of different document types that we need to be able to deal with. Many of these documents are well-handled by specific pieces of existing software, but---at least as far as I can see---there is no single piece of software, or coordinated suite of software packages that allow us to deal with many of the types of information elements that we might find it useful to be able to handle. |
There is probably a `hierarchy' here that I am missing. |
|
Outlines |
Outlines are a natural data structure that often leads directly to a document when text is applied. Outlines can become a document by filling in a structure of text. Outlines are also a useful form for reorganizing information where moving around information can preserve the hierarchical relationships. |
While outlines are inadequate for some aspects of some problems, they are often a useful superstructure for handling information in a document. |
|
Wiki Structures |
Wikis are another structure for documents that is worth studying. First, this may be one of the most natural applications for drupelets which share text between Wikis and other documents. Second, because they represent a form of document which, while structured, is not necessarily hierarchically structure. This may have interest implications for implementation on several different dimensions. |
My experience with Wikis is mixed. I originally thought that they were `great', but over and over again I have seen them follow a brief flurry of interest which leads, in most cases, to abandonment except by a relatively small cadre of people who are `professionals' with Wikis. While this group of people is not identical to the people who are interested in Extreme Programming, there is a more than casual overlap. |
|
|
Rendering Wikis in Print: An example of the kind of difficulty encountered with Wikis, is that while they generally present quite nicely on a screen---where they can make use of both the interactive character of computer displays and with their rapid rendering on screen. The very non-tree-like structure of the typical Wiki does not lend itself easily to rendering on paper. |
Either a Wiki doesn't need the elaborate linkage back and forth between elements, in which case it can be rendered on paper quite easily, or it does, in which case it can't. It is a sort of damned if you do, damned if you don't situation. |
|
|
Automatic Object Recognition in Wikis: One aspect of Wikis that require special study is the fact that `Objects' are written in a particularly recognizable form which makes it easy recognize which text element represent Objects that are described in other Wiki pages. Whether this is a great idea or a bad idea remains to be seen, but some study of it should prove to be fruitful. |
There are some alternatives (explicit markers, for example) that might make this problem much easier, if it is an important one. |
|
Typesetting with TeX |
TeX is a rendering engine for high-quality typesetting. It can be useful in taking drupelet documents out to some printed form. |
There is a question in my mind about whether TeX has a `special' role to play or not. |
|
Diarys |
A diary is perhaps the most common structure for event-oriented information about appointments, travel, and the passage of time through a day's worth of activities. |
Diary information can be a summary of information flows about a number of different aspects of daily life. |
|
Expense Summaries |
Expense summaries are an interesting example of semi-structured information. The are structured to the extent that they generally involve a date, time, and monetary amount. But they also sometimes arise in unusual ways that require some form of ad hoc adaptation. |
Forward commitments are one example. Are there better ones? |
|
Papers |
Academic `papers' are a form for presenting information which is quite common to those of us who live in an academic world. White Papers and Position Papers are similar things in other environments. In our experimental environment papers are being written using drupelets as the key elements in the composition process. This allows an evolution from, in this case, a two-up presentation of the information along with a comment to a one-up more conventional presentation as a rather straightforward part of the process. It also makes it easy to produce an `abstraction' document that shows the outline of the information without any difficulty. |
Academic papers are probably the most important single case in ou environment. |
|
Parallel-ogues |
Parallel-ogues are a particular form of critique where each paragraph of a paper is commented on by a corresponding paragraph of critical text. This has been being exercised already in several papers that have been released. So far the experiment seems to be successful, in that this seems to provide a technology that can be effectively used to solve the problem. |
It may be a little early to claim any victory for parallel-ogues. But at least through half a dozen or so exercises, the technology seems to be holding up well. And there's every prospect that the experiment will continue to produce favorable results. |
|
Medical Measures |
An increasing number of medical devices of which prototypical examples might be blood glucose measurers or blood pressure meters, are now becoming capable of communicating with computers. As this happens, it becomes possible to integrate the measures taken with the rest of the schedule elements that might naturally fit into a schedule. |
I'd be particularly interested in further examples of the kind of instruments that might fit into this category. |
|
Audio/Video |
Increasingly, current technology allows us to incorporate audio / video `images' into documents. In addition, catelogues of this kind of information are an example of some of the documents that we might want to produce. |
The examples here should proably be broadened somewhat. |
|
The Desktop |
The desktop is another example of a `document type' that it is useful to be able to represent. While it might not seem useful, at the outset, to regard a desktop as a document type, it doesn't require much of a stretch to do so. A typical `desktop' has a phone, calendar, folders full of various kinds of paper, mail, calculator, account books, etc.. Representing this can be quite useful. |
Does this description make sense? |
|
EMail |
EMail is one of the most interesting types of document that needs to be managed. At first glance the problems of handling it seem to be trivial, but there are lots of issues associated with managing attachments, in particular. |
Are there more complexities elsewhere lurking as well? |
|
Index Cards |
Index cards are one useful way of looking at things like schedule data and to-do lists. While some will use PDAs to handle this kind of function, index cards prove to be a time-honored and comfortable way of dealing with lots of different kinds of information. This `technology' actually has many advantages over newer implementations that involve more elaborate hardware. Index cards can be written on, re-organized, and they function well without access to power etc. |
I love index cards. They are easy, lightweight and highly portable. Their usefulness is often underestimated in the presence of more elaborate, newer, seemingly more glitzy hardware. |
|
Collaborations |
Collaborations are a special case of document because they more explicitly involve and important component of communications complexity that would be the case for most other forms of document. A collaboration, which involves two or more authors, can be a daunting thig to manage. And if one is not careful, work of one collaborator may get overwritten by the work of another collaborator. |
Are collaborations really different? Or is it just a focus because this document is an example? |
|
Travel Plans |
Travel plans have are another useful example. Travel is a semi-structured activity. It is structured to the extent that dates and times of travel mechanisms are common to such plans. So are addresses, phone numbers and confirmation numbers. Travel also `activates' new parts of our data bases. Our `restaurant list' for Singapore is probably of only modest usefulness on most days. But if we happen to be in Singapore, then the numbers are much more relevant to our situation. |
Parsing data along a `location' dimension is a useful thing. Are there other dimensions that it might be worth mentioning. |
|
GPS Tracks |
On the surface GPS tracks are simple time-oriented position indications. However, if the time span is sufficiently short one can use a GPS as a hyperaccurate log generator. Having a GPS track of movements along with a waypoint list allows one to make a reasonable guess about an event oriented log, that would produce an accurate diary. |
A fair amount of technology to support this function already exists. For the past several years I have been generating logs of daily activities. It has become a straightforward and regularized process. |
|
Schedules |
Train and plane schedules are an important source of information that is increasingly available on the Web. In most cases these schedules need to be `scraped' back into some processable form, but the existence of the data in some processable form provides an opportunity for getting the information into a form where it can be integrated with other information resources into forms which might prove to be more useful that the old ways of managing this type of data. |
Some scheduling information is easily available, but other pieces of it have been put into forms---PDF mgiht be an example---that are less useful than the old ways of just having the data in some convenient ascii form. |
|
Ins and Outs |
In this discussion there has been an underlying assumption that we are dealing with increased quantities of information. This raises important questions about how this information can be effectively captured into our systems and how it can be rendered in ways which are particularly useful and effective. Several technologies are worth some special consideration on both the input and output side of the situation. |
As volume of information handled increases, it is natural that problems of input and output arise. |
|
Sources of Input |
Source of input are particularly important. First, huge quantities of information are available at a very low cost on the net, but sources of input that require a lot of human input can be very expensive. They can also be slow. As more and more real-time information becomes available on the net, this is increasingly a hinderance. |
There are probably more input sources that are worth describing. |
|
|
Screen Scraping: Screen scraping is the activity of extracting information from pages that are being displayed on the screen. While this is certainly not `scraping' in the common sense, it is a reasonable metaphor for that activity. In practice one `scrapes' by analyzing the source that produces the screen image. Screen scraping appears to differ from lots of other parsing and categorizing in that there actually is (or at least should be) a two-dimensional approach to the problem. Screen information is not always just `linear', the vertical dimension often has a real significance that we need to be able to exploit. |
Note the conventional `recognition' technologies like Regular Expression parsing are generally quite linear in their structure. |
|
|
Scanbacks: Scanbacks represent a technology which is currently rare, but which holds a tantalizing prospect of being of some considerable interest to us, if it can be implemented at reasonable cost. The basic idea is straightforward. One very useful form of communication of input to the computer is to take a computer based document and write on it, and then scan back the image of the original document with the hand-written additions. Now, at the moment this is quite easy to do if we are willing to accept the fact that all we do is take a `picture' (fax image) of the document. But that's probably not what we want. It would be better to be able to scan back the image and then `subtract' the original computer generated component so that all that is left is the new addition, namely the hand-written marks. This is not only a much lower volume of data to store, it allows us to focus on what is `new' in the situation, and not be distracted by the `stuff we already know'. |
Is there a better word than scanback? |
|
|
Bar Codes: The technology of using bar codes---of such phenomenal importance in all of our grocery stores---is surprisingly underutilized in computer-based information systems. While most people don't have scanners as elaborate as those that are common in the grocery business, it certainly is possible to produce reasonable scanners at a very reasonable cost if we had any real use for them. Bar codes afford the particular opportunity for associating physical objects with computerized data bases. |
Would it be useful to describe the possibilities here in greater detail? If the point is easy to see, perhaps there's no need to. |
|
|
Phones: The cel phone offers the interesting prospect of becoming an important device in the process of managing information. First, and perhaps most important, it is a device we are quite likely to be carrying independent of other considerations with respect to information availability. It is a common, `ordinary' device generally necessary today to maintain normal life communications. However, as phones have become more sophisticated they have, first, become important repositories for our phone numbers. In addition, phones have become event reminders---with event information downloadable through conventional phone transmission. |
Need we go into more detail about the role---both actual and potential---of the phone? |
|
|
Machine Readable Documents: Today we can more an more prepare documents in a way that will make them machine readable. It is easy to print bar-codes and then scan them back. Further, recognition technology has improved to the point that blocks of information can be read back into a machine without much human intervention. And then there is a broad class of documents that are produced by machine, and are often available in some pre-scanned state (for example, the files which produced the document) where we can gain access to the document before it has gone through the composition process. This affords lots of opportunities for getting information back. The increasing volume of information that is available in some machine-based form is testimony to the fact that this problem is becoming more and more important. |
How good is character recognition technology these days? |
|
Destinations for Output |
An increasing number of devices are now available for displaying the output of the kinds of processes described here. This adds a complication to our lives. In former times we never needed to worry much about the display of information. How information was displayed was pretty much up to the creator of the information, and we had two choices: to accept his/her way of doing things, or to not use the information at all. Similarly, people producing the output could plan display with rather complete knowledge of the target display medium firmly in mind. Along with the flexibility of the new media comes both a flexibility and the need to have controls that allow us to exercise it in ways that would be useful to us. |
The issue of who is in control of output, the writer or the reader, is not only interesting but it is largely uninvestigated. |
|
|
Screens: Screens have the disadvantage of generally requiring some form of `live' power, but other than that they are increasingly high quality portals into an information space. While the fineness of screens does not yet approximate that of high quality printing, they are not only getting better and better, they also have other advantages. For example, it is possible to change the image on a screen fast enouh to give the appearance of an impression of motion. Thus animated presentations which would be essentially impossible on paper are quite plausible to distribute via screens, as long as the bandwidth into the screen is broad enough to be able to provide high quality images at a sufficiently fast rate. |
Might want to add some comments on screen size. |
|
|
Paper: Printing on paper is still very important, probably the most important display medium of all. Printers have now become cheap enough that they are within the reach of just about everyone, and high quality printing has become cheap enough so that it is widely available in most commercial instillations, as well as to lots of private individuals. The availability of high quality printing at business and personal sites makes the integration of image information into text documents a realistic possibility. This is clearly revolutionizing the photography industry, and it is beginning to have an impact on the quality print industry as well. |
The availability of printing options makes layout language more important. Thus static/non-layout oriented technology is less useful and technology which allows specification of information in two-dimensional displays. |
|
Sources of Both |
There are some devices which are both sources of input and vehicles on which output can be displayed. In addition there are devices which allow interaction between the input devices requesting various functions with respect to the information and the information as it is displayed. |
Is this a worthwhile class to distinguish? |
|
|
PDAs: Personal Display Assistants are very important to some people, and of no use whatsoever to others. In any case, they have become an important destination for information---particularly telephone numbers and scheduling events. At the current time a battle is being fought between the cel phone and the PDA for control of this territory. It is unclear how this will all work out, but current indications are that phones seem to be developing adequate PDA functions at a faster rate than PDAs are handling the tasks associated with phones. The situation ought to shake clear over the next couple of years. |
IME, PDAs are dead, except for a narrow user community, unless some new problem domain turns out to be handled by them in future implementations. Their problem space is being squeezed by cel phones on the one side and small lightweight subnotebooks on the other. The most important remaining problems for small computers are `instant on' and battery life, and strides are being taken on both of these dimensions. |
|
Meetings |
It might be worth recording meetings just for the purpose of having a historical record of the project. In this case, however we have the additional reason that this serves as an example of something that we might want use drupelets to record. |
Dual purpose uses are a bit dangerous, as sometimes they lead to stretched cases. |
|
13 May 2003 |
The initial face-to-face encounter on the project took place on Tuesday, 13 May 2003. A broad range of topics was considered. Among these were scope, direction, technology choices and problem specification. |
Should there be a more elaborate record? |
|
Documents |
This is a list of documents that are in various stages of thought / preparation for this project. |
Are there output forms other than documents, or perhaps more categories here? |
|
Ideas |
This is a list of some documents that we might want to produce. |
Shoud this be said some other way? |
|
|
EMail Drupelets: Considerations: The EMail project has just been started. Already it is proving to be challenging, with some subtle issues that come up very early in the analysis process. |
I'm surprised at how much of my mailboxes is devoted to the storage of attachments. How these should be handled and whether they should be searchable for keys (obviously, if they are images---for example---they probably shouldn't) remains to be seen. |
|
|
Document Drupelets: Field Definitions: This docuement describes the fields that have so far been encountered in constructing documents. |
Covers issues of `necessary' fields, as well as field ordering. |
|
|
File Formats: Object Form: A description of the details of Object Form formats. To be written by Ness. |
Due Date? |
|
|
File Formats: Short Form: A description of the details of Short Form formats. To be written by Ness. |
Due Date? |
|
Drafts |
This is a list of documents that have been `accepted' to be prepared for publication. The documetns are in various stages of draft. |
Are there degrees of completion? What about assignments? |
|
Published |
Here is a list of documents that have been published, along with the URLs which allow access to them. |
Should this list conform to some other `bibliography' structure? |
|
Summary: Gains and Losses |
Managing informatnion and documents via drupelets has some distinct gains, but it also causes us, in some places, to make things harder. There's no such thing as a free lunch, comes to mind. Let's look as some of each. |
Are these more? |
|
Loss: Need to Structure |
Using the drupelet structure may mean that we have to give a document rather more meta-structure than we might want to give it in particular circumstances. This may imply extra work, and if we don't even end up by taking advantage of the things that drupelets allow, then incurring this cost may prove to have little benefit. |
There may also be a positive aspect simply in that the structuring may help clarify the thoughts being captured in the document. |
|
Gain: Use Information more Widely |
Information stored in drupelets may have the advantage of being useful in situations other than the one for which it was originally intended. A trivial example of this might be such as the TV schedule which is prepared for presentation purposes, but which also may serve to create diary entries. |
Communications aspects may also be relevant. |
|
Gain: Separate Collection and Use |
Drupelets provide a particular way of separating the collection of information from places that it may end up being used. For some problems, this is a very natural structure and organization. By helping solve the `collection' problem, certain documents that would otherwise be difficult to create, can be created quite straightforwardly. |
This both a `logical' and a `physical' point. |
|
Gain: Learning from Abstractions |
The existence of abstraction drivers also affords the opportunity to learn things both by discovering new abstractions and by allowing us to experiment by applying existing abstraction rules to new documents. |
This is a useful notion, and perhaps an example or two of use might be useful |
|
Body Language |
The body of drupelets are obviously stored in some language. Just what language is appropriate remains to be a matter to be investigated. At the current time drupelet bodies are stored in HTML Markup. Other possibilities include at least Code Markup, and there are probably other considerations. |
Earlies experiments used a special form of markup that was translated into HTML. More recently, a `pure' HTML markup has been a more common experiment. |
|
HTML Markup |
HTML markup is quite straightforward. In general, the drivers that produce output surround the body of the drupelets with paragraph markers. As a result, paragraphs within a drupelet are marked by an end-paragraph, begin paragraph pair (rather than the reverse). Font choices, etc. are marked with conventional HTML. |
This strategy has the advantage that it is simple. It has the disadvantage common to all HTML schemes. namely that the text has to be processed to make sense out of it in any context other than simple presentation in a browser. Whether this is the right choice or not, remains to be seen. |
|
Handling Code |
We have not yet encountered any practical circumstances where we need to handle code. What is unusual about the general `code' problem is that white space may have important implications in the interpretation of the code, and therefore the presentation mechanisms may well not be free to play fast-and-loose by reformatting the text as it is stored. |
Nothing has been done on this front, except to give a very little consideration to how it might be handled. Much more serious effort should be devoted to this question. |
|
Further Work Here |
This is a temporary section that suggests some future work that may, or may not, make it into the document. |
This section will be eleminiated, out of necessity, when we decide that the document has completely defined its scope. |
|
Site of Composition / Imposition |
There may be something to be learned from thinking about the implications of where work gets done in the composition / imposition process. For example, there is some interesting software that used Postscript plus the computational capacity of modern printers to do significant computational work in the printer rather than in the computers that drive the computer. Whether this has any applicabiliity or use in the particular situation here should be investigated. |
The nature of what has to be communicated to the printer is rather of different magnitude if work is done in the printer, rather than in the computer. This may have a particularly significant role in the rendering of images and other high-density images. |
|
Some Bewares |
There is a beware about preparing documents that involve elaborate cross structures and linkages. A real problem can arise if some part of the structure is lost. If we are not careful, such a loss may imply the loss of nearly all of the useful information because of unresolvable cross references. |
This may, or may not, turn out to be an important problem. Deciding which it is, and how it should be handled is important. |