Project Gilgamesh: Implementation Archives

June 30, 2003

#ifdef J2ME

Reading the source code for Simkin, I noticed some

//#ifdef
...
//#endif

tags wrapping things that aren't supported by MIDP. (Floating point, Reflection etc.) This seems to be a fairly reasonable use of #ifdef. Then I vaguely remembered (and confirmed) that Java doesn't have a preprocessor, and therefore doesn't support ifdef!

So... I wonder how Simkin is supporting this conditional compilation: is there an Ant task to preprocess versions of a source into appropriate directories and then build them?

And, if ifdefs are really considered such bad practise that the dubious miracle of "Write once run anywhere" was supposed to eliminate them, how should this be written?

From a vague recollection of (and sudden enlightenment on the purpose of) the Factory design pattern, I suppose that the code should be factored out into (taking the Interpreter class as an example)

AbstractInterpreter base class
J2SEInterpreter class. Includes methods which call objects via Reflection and others which add Float variables to other Floats and Integers.
J2MEInterpreter class. Includes only methods which call objects via a defined interface. Float objects are not used (substitute with some Fixed Decimal class).
The AbstractInterpreter class cannot be instantiated, but can act as a factory, returning the appropriate type of Interpreter class
Now it is easy for Ant to compile the classes without having to preprocess. One target can exclude source files ending with J2SE and the other J2ME.

OK, I starting this as a rant about why isn't #ifdef supported, and I think I may have convinced myself that I don't need it. Any other comments from anyone that's come across this in J2ME or otherwise very welcome!

(Googling "ant conditional compilation" finds various links such as this one which look relevant)
Update: Aha, it looks like Antenna, an Ant extension specifically for Wireless Java includes a preprocessor task among other goodies.

Posted by osfameron at 04:11 PM | Comments (1) | TrackBack

June 12, 2003

Dynamic grid data-structure: autovivifying cells

The coding is going well though, due to lack of time, much more slowly than I'd like. So I'm still working on the container model (Grid, Axis, Rows, Columns, Cells), which turns out to be more interesting than I'd thought.

I'd originally planned to have a fixed size grid, which could be expanded if necessary. Microcalc works this way: you specify a starting size (8 x 32 by default) but can insert/delete rows later to change the size.

Dynamic grids

Then I decided that I wanted the grid to be more dynamic: Rows and Columns will be appended when needed, and deleted when not. If I ask for a cell (E111) or a range (A1:Z100) then that object, plus any relevant structure like the Rows and Columns that contain it should also be returned. But only if they are required! To explain what I mean a bit better:

In the User Interface, user clicks on cell E13.
I want to be able to return the Formula of that cell, even if it doesn't exist.
If the user modifies the formula, I want to be able to write cell.setFormula(newFormula) and transparently handle the cases where in fact that cell, and the row and column it's in don't exist.
If however the user decides to move on and click on another cell, I don't want to have created a cell, rows and columns etc. that are now unused.

My solution for this is to return a proxy-object if the real object doesn't exist. Actually, I don't think that Java supports proxies as such (like Mark Overmeer's Perl module Object::Realize::Later) which dynamically change their class to the requested one when certain conditions are fulfilled. So I'm actually settling for these definitions:

Real: an object that is referenced by its container object. (A Cell in a Line, a Line in an Axis)
Proxy: an object which knows which container it should be in. However, its container does not have a reference to it.

Here's an example of the usage of these proxy objects:

We request cell at (3,3)
Grid gets Row 3 and Col 3
... but there are currently 0 rows. The Row LineAxis creates a proxy Line at order 3 and returns it.
... and the same for the Column LineAxis. NB: neither LineAxis actually keeps any reference to these proxy objects.
Grid checks if ProxyRow 3 contains a Cell intersecting with ProxyColumn 3.
... it doesn't of course. So it returns a Proxy Cell which claims to be in Row 3 and Col 3.
If we later discard the cell then the Proxy Row and Col objects have no references to them, so they will be discarded.
But: if we update the cell, it will now try to upgrade itself into a proper cell by calling its registerSelf() method.
This will ask both its Row and Col to register it. They will now maintain a reference to the cell.
As soon as the Proxy Lines register the cell, they will also have to register themselves (again by calling a registerSelf() method.
The respective LineAxis will note the current size (0), and the new order (3).
LineAxis first appends 3-0=3 new lines to the Axis, and then appends the proxy Line.
It doesn't now matter if we lose the reference to the Cell, as it is now properly registered in the Grid structure.
The next time we request cell at (3,3)
Grid gets Row 3 and Col 3
... this time these are real rows.
... and checks if Row 3 contains a cell intersecting with Col 3. This time it does and the cell is returned.

interfaces

I coded this rather ad-hoc, but will refactor into 2 interfaces:

public interface GridComponentCandidate {
	public boolean isInteresting();
	public void registerSelf();
	public void deRegisterSelf();
	public void addRegistrar(GridComponentRegistrar g);
	public void removeRegistrar(GridComponentRegistrar g);
}
public interface GridComponentRegistrar {
	public void register(GridComponentCandidate c);
	public void deRegister(GridComponentCandidate c);
}

Interesting?

The method isInteresting() returns the conditions under which a Component needs to be registered. For example, a Cell will be registered if it has (any one):

a value
a formula
a format
dependencies (one or more nodes refer to it)
a name/alias

A Line will be registered if it has (any one):

one or more registered cells
one or more registered ranges
a custom name (?)

Each GridComponentCandidate object will check whether it has become interesting (or vice versa) in any method that updates any of these criteria.

Subscribe relationship

This is basically a subscribe/callback relationship, so I guess addRegistrar(...) and removeRegistrar(...) are modeled on the Listener interface. IIRC, J2ME doesn't have a generic Listener interface to inherit from, so I'm keeping it light-weight by not coding one up (but this decision may be wrong and I'll change it if so). Originally I hardcoded the registerSelf() methods (Cells had to update their Row and Col; Lines had to update their LineAxis) but I think it may be more flexible (for example, for testing) to use this model.

LineAxis more complex

For LineAxis, when it is asked to register(Line l) or deRegister(Line l), the behaviour may be quite complex. For example:

Create line 3 proxy
Update line 3 and register
... Axis appends blank lines 0-2, line 3
Create line 5 proxy
Update line 5 and register
... Axis appends blank line 4, line 5
Line 3 becomes uninteresting and requests to deregister.
... however, Line 3 is not the last line, so no action is taken.
Line 3 become interesting again and requests to register.
... Axis already has a Line 3.
... so we check that it is the same as the registering Line 3.
... it is, so we do nothing. (If it had been different, we'd need to raise an exception: see the next section on Out Of Date Proxies)
Line 5 become uninteresting and requests to deregister.
... Line 5 is the last line so
... LOOP: The last line (5) is uninteresting, so deregister it.
... LOOP: The last line (4) is uninteresting, so deregister it.
... LOOP: The last line (3) is interesting -> BREAK

Out Of Date Proxies

If we keep a reference to a proxy after another proxy has autovivified, it may be out of date. For example:

Create line 3 proxy.
Create line 5 proxy.
Update line 5 and register
...Axis appends blank lines 0-4, and line 5.
Get and modify (real) line 3.
Update line 3 proxy and register
CONFLICT: line 3 already exists.

There may be better ways to deal with this, but I'm currently assuming that only these objects:

Current Cell
Selected Range
Destination Range (for copy-paste)

will ever contain proxy cells or ranges. As they will constantly select new objects, old copies of the proxies will not stick around. If a careless programmer (me) forgets this and keeps an out of data proxy, then an Exception will be raised at runtime. I thought that this kind of problem might be irritating to debug, as the proxy could have been assigned at any time; but on the other hand, if the proxy isn't in one of the documented objects above then it might actually be easy to track down.

There is an additional complexity for Ranges: the Head Line will need to be registered before the Tail Line. (Otherwise when the Head line tries to register it will be rejected as Out of Date!)

`Data::Table`

Thinking about representations of the Grid, I recalled the Perl module Data::Table. Looking at this, it appears that it maintains Cells in either Rows or Columns (but not both). However, if it would be useful to treat them in another way (for example to insert a Column in a Row-based table), then Data::Table internally rotates the table to the other representation before continuing. That's actually quite clever, and more efficient in terms of storage requirements. I don't think I'll follow suit, but something to think about.

Posted by osfameron at 11:21 AM | Comments (0) | TrackBack

May 16, 2003

Cell evaluation order and cyclical calculations - part 2

Simple ordering?

In Bob Frankston's notes on implementing Visicalc, he noted that they weren't tempted for long by 'natural order' updates.

Since the formulas did depend on each other the order of (re)calculation made a difference. The first idea was to follow the dependency chains but this would have involved keeping pointers and that would take up memory. We realized that normal spreadsheets were simple and could be calculated in either row or column order and errors would usually become obvious right away. Later spreadsheets touted "natural order" as a major feature but for the Apple ][ I think we made the right tradeoff.

Of course, the Visicalc team were implementing their spreadsheet in "a 16KB target, that included enough space to actually hold a useful spreadsheet" (!). This makes me think

The Symbian devices Gilgamesh will target are also highly constrained. I need to think about whether the fact that natural order is more elegant and powerful is worth the memory required!
On the other hand, a device like the Sony Ericsson P800 can easily cope with a much larger application of several megabytes. I've been working with a complete guesstimate of about 200K for the compiled application.

On balance, I think that we will stick with natural order. It just feels like the Right Thing.

When circular calculation is useful

Alan Eliasen's comment on cell calculation noted that there are reasons not to ban circular calculations completely: There is a certain class of problems that require iteration to gradually work towards a correct answer. The Spreadsheet just has to stop calculating when it gets the right answer ;->. Alans notes that 2 ways are:

Having a "convergence limit" for each cell. If the value of the cell (or all cells) changes less than a certain amount, then consider the solution "good enough". If it's not good enough, add all dependent cells to the dispatch list to be calculated again. This has many subtle problems.
Having a "iteration count" associated with each cell. If a cell tries to solve too many times, it's probably not converging and we aborted the run with an error.
Both of these have an obvious memory cost. Solution 1 can be reduced by having a single convergence limit for *all* cells.

Professional spreadsheets will tend to disallow cyclical calculations by default, but can be configured to allow them. A quick peek at Microsoft Excel 2000 shows that a global option can be set to allow iteration. You can select both a maximum iteration count, and a Maximum change (convergence limit).

I have to admit that I'm a little blinkered in that I've barely ever used iteration in a spreadsheet but my first thought is that Gilgamesh as a constrained J2ME application should not allow circular references at all. Sure, this means reduced power, but the Keep It Simple principle may be the more important consideration. Thoughts?

Posted by osfameron at 09:51 PM | Comments (0) | TrackBack

J2ME Floating Point Libraries

Michael Zemljanukha (the author of the MicroMath library and the MicroCalc spreadsheet) posted a number of helpful comments, including some pointers to floating point libraries

http://sourceforge.net/projects/jmfp
http://henson.newmail.ru/j2me/Float.htm

Alan Eliasen previously mentioned that he uses

http://www2.hursley.ibm.com/decimalj/

for Frink.

Posted by osfameron at 09:44 PM | Comments (0) | TrackBack

May 08, 2003

MIDP and Iterators...

Of course, checking over the J2ME documentation, the Java MID Profile (MIDP 1.0) doesn't have Iterators, which does affect the Graph DFS implementation I sketched out before.

I could create my own Iterator interface and classes of course, but I'll think a little about other ways to do it too.

Update: 2003-05-09 01:47 OK, here is a version using MIDP's basic Collection structures, Stack and Vector. This DFS class exposes an interface similar to J2SE's Iterator. Note that this version also checks for Circular paths.

import java.util.*;
public class DFS {
	private Stack stack;
	public DFS(Node n) {
		stack = new Stack();
		push(n);
	}
	public boolean hasNext() {
		if (stack.isEmpty()) {
			return false;
		} else {
			return true;
		}
	}
	private void push(Node next) {
		Vector seen = new Vector();
		push(seen, next);
	}
	private void push(Vector seen, Node next) {
		Vector vNext = new Vector();
		vNext.addElement(next);
		push(seen, vNext);
	}
	private void push(Vector seen, Vector next) {
		stack.push(seen);
		stack.push(next);
	}
	public Node next() throws Exception {
		Vector next    = (Vector) stack.lastElement();
		Node   nextval = (Node)   next.elementAt(0);
		Vector seen    = (Vector) stack.elementAt(stack.size() - 2);
		if (seen.contains(nextval)) {
			throw new Exception("Circularity error: " + nextval + " >>> " + seen);
		}
		next.removeElement(nextval);
		if (next.isEmpty()) {
			stack.pop(); stack.pop();
		}
		if (! nextval.edges().isEmpty()) {
			Vector newSeen = vecCopy(seen);
			newSeen.addElement(nextval);
			push(newSeen, vecCopy(nextval.edges()));
		}
		return nextval;
	}
	public static Vector topoSort(Node n) {
		DFS d = new DFS(n);
		Vector topo = new Vector();
		while (d.hasNext()) {
			try {
				Object o = d.next();
				topo.removeElement(o);
				topo.addElement   (o);
			} catch (Exception e) {
				e.printStackTrace();
				System.exit(0);
			}
		}
		return topo;
	}
	/////////////////////////
	private Vector vecCopy(Vector old) {
		Vector newVec = new Vector();
		for (int i=0, s=old.size(); i < s; i++) {
			newVec.addElement(old.elementAt(i));
		}
		return newVec;
	}
}

Posted by osfameron at 09:15 PM | Comments (0) | TrackBack

Cell evaluation order

I've given this some thought and I'd been planning to get to this shortly, but Alan Eliasen asked about it. Alan gave a good link to Bob Frankston's notes on the development of VisiCalc which also mentioned the evaluation strategy they used:

Row order

Every time a cell is changed, loop through all the other cells updating them once. (An obvious alternative strategy is column order).

I'd heard of this, considered it briefly, and rejected it because this brute force sounded inelegant, involving too much unnecessary work, especially for a larger sheet.

Cascading updates

Originally, I'd planned for every cell to keep a list of pointers to dependent cells (any cell that references it).

   .-->B--.
  /        \
 A          +-->D-->E
  \        /
   `-->C--'

When a cell is updated it will ask its dependent cells to update themselves. Those cells in turn may update other dependent cells: this means that there is a risk of circular referencing. So when I trigger the update I pass a reference to a Collection (an ArrayList, Vector, or HashTable) adding the original cell.

Of course, the same cell may need to update itself 2 or more times: this is OK, as long as each time is on a different evaluation path. Here, D is evaluated twice, but the cell-paths passed in are not circular:

 (A)
  .-->B--.(AB)
 /        \      (ABD)
A          +-->D------>E
 \        /      (ACD)
  `-->C--'(AC)
  (A)

Each subsequent cell checks if it is in that Collection before appending itself to it. If it's already there, then this calculation path was circular, and should generate an error. Here, cell C is asked to recalculate in a path that already includes C:

  (A)
   .-->B--.(AB)
  /        \      (ABD)
 A          +-->D------>E
  \        /      (ACD) |
   `-->C--'(AC)         |
   (A) ^               /
       |              /
        `------------'
           (ACDE)
             ^
      Circularity
      Error here!

Natural order

This would work, but means that the same cell may have to recalculate more than once, which could be inefficient, and might have some pathological cases.

It would be best to calculate in which order the cells have to be updated, and then calculate them all once only. This is called 'natural order'. And while you might think that calculating the order to evaluate cells in was a tricky task, in fact there is a simple way to do it.

If you have a Computer Science background, you might recognize the dependency diagrams above as a DAG, or Directed Acyclic Graph. And you might realize that all we have to do get the evaluation order is to run a topological sort on single-source DAG. If you don't understand that, then read on!

Graphs in computer science aren't the same as the nice bar-charts and diagrams that non-geeks might expect them to be. They are just

a collection of nodes (like our Cells)
connected by edges to other nodes. (the references to other cells)
The connections can be directed, which means they go in one direction only (Cell A1 is a dependency of B5. But not the other way round!)
The graph can be acyclic if there are no paths which come back on themselves (e.g. there are no circular references).
A graph can have one or more sources, which is the entry point into the graph. (In the diagram above, I showed a graph with a source at Cell A. But of course if we were updating Cell B, C, D... Z, we would use the part of the graph that started with that cell. That sub-graph might not go to every node in the full graph).

So, the diagrams on cell dependencies represent Directed Acyclic (hopefully) Graphs with an entry point at a Single Source which is the first Cell to be updated.

Topological sort is a sorting technique that involves visiting each node and the nodes it links to in turn (often called 'traversing' the graph) in such a way that all dependent nodes are listed last. You can walk the graph in 2 ways, "Breadth first" (BFS), or "Depth first" (DFS). This is actually rather intuitive:

Breadth-first means visiting all the nodes from the Source node. Then all the nodes pointed to from all of these nodes. And so on.
Depth-first means starting with the Source node, and diving all the way down first one branch, then the next.

An accurate & elegant description of the topological sort algorithm, then, is to do a depth first search and list the nodes in the order in which they are last seen. The easiest way to understand this is with an example:

   .-->B--.
  /        \
 A          +-->D-->E
  \        /
   `-->C--'

Starting with cell A we follow this branch to its end (depth first)

A
A B
A B D
A B D E

and then start on the other branch

A B D E
A B D E C
A B D E C D (now that D has appeared later, we scrap the earlier entry)
A B ~~D E~~ C D E

Leaving

A B C D E

If none of this makes sense, you're in good company. Certainly, I found a lot of this baffling, and I had better teachers than me. (Specifically, I read the Graphs section in Algorithms with Perl which is quite extensive and clear, and offers nice implementations - albeit not in J2ME). I've come across one book, Introduction to Data Structures and Algorithms with Java, which I didn't buy because it is old (in Computer book terms), and also looks like it spends a lot of time looking at non-graph related things like AWT, which makes it seem a bit incoherent). Maybe this book is good? I think it's also worth looking at language-generic introductory guides to Computer Science.

Beginning an implementation

It is fairly trivial to implement a minimal graph-walking algorithm doing a simple recursive tree walk. Of course it's often more elegant to do this iteratively. (For example Dominus will have some chapters on this in his forthcoming book). Using the Iterator interface is also Java best practise. (Though I think J2ME doesn't define the Iterator interface).

Here's some basic setup for a basic Node class.:

import java.util.*;
public class Node {
	private Vector  edges;
	private String  name;
	public Node(String name) {
		edges = new Vector();
		this.name  = name;
	}
	public Collection edges() {
		return edges;
	}
	public String toString() {
		return name;
	}
}

It is fairly easy to add a DFS Iterator class:

import java.util.*;
public class DFS implements Iterator {
	private Vector iterators;
	private Node       curr_node;
	private Iterator   curr_it;
	public DFS(Node source) {
		curr_node = source;
		curr_it   = curr_node.edges().iterator();
		iterators = new Vector();
		iterators.addElement(curr_it);
	}
	/** The class works by keeping a stack of Iterators
	  * for each fork in the Graph.  The hasNext() method
	  * checks if any of the Iterators in the stack have
	  * any more Nodes to return (e.g. if they return true
	  * on hasNext()).
	  */
	public boolean hasNext() {
		while (! iterators.isEmpty()) {
			curr_it = (Iterator) iterators.lastElement();
			if (curr_it.hasNext()) {
				return true;
			}
			iterators.remove(iterators.lastElement());
		}
		return false;
	}
	/** Every time a new object is selected, add the
	  * Iterator for its own edges to the stack.
	  */
	public Object next() {
		Node next = (Node) curr_it.next();
		Iterator new_it = next.edges().iterator();
		iterators.addElement(new_it);
		curr_it = new_it;
		return next;
	}	
	public void remove() {
		; // just to fulfil requirements of interace.
	}
}

(Adding checking for cyclic paths in the graph is left as an exercise to the reader...)

And of course, add a method to the Node class:

public Iterator DFSIterator() {
		return new DFS(this); 
	}

Now it's easy to create the Topologically sorted dependency list (add this method to Node class). Iterate through the Graph, adding the latest node to the end, first of all removing any previous occurrences:

public Vector dependencyList() { 
		Vector v = new Vector();
		v.add(this);
		Iterator it = this.DFSIterator();
		while (it.hasNext()) {
			Object o = it.next();
			if (v.contains(o)) {
				v.remove(o);
			}
			v.add(o);
		}
		return v;
	}

Coming up soon:

Maybe I'm wrong about optimization of row-based vs natural?
Why we might want Circularity

Posted by osfameron at 10:07 AM | Comments (0) | TrackBack

May 01, 2003

Expression evaluation: Simkin, Micromath, Frink

I've come across some other projects, and had to re-question whether Gilgamesh will have its own expression evaluator as I suggested above. All comments welcome!

Simkin

At the pub-meet I met Simon Whiteside, enthusiastically promoting his Open Sourced scripting language, Simkin. I'd read the blurb on it before, and thought "huh?" before bookmarking it and moving on. Simon says that Simkin sometimes gets this reaction, but has convinced me that it's worth looking at in more detail!

It's lightweight (50K jarfile), is implemented in both C++ and Java, and has been ported to J2ME.

it is also an expression evaluator which, if I understand rightly, can easily be customized to include custom objects. So, for the purpose of Gilgamesh, something like

=A5 * 4 + sum(B2:C5)

Could actually get parsed as a Simkin expression which refers to Gilgamesh objects for the Cell (A5) and Range (B2:C5).

Though I'm happy to steal, um, leverage existing code to parse formulas (as I wrote above, I was quite happy to leave this to someone else!), I guess that it will have some impact on Data representation, and on formatting of Cell results. (I'll try to write up my initial thoughts on these soon). So I have to look into any existing code and API before I commit. Some good documentation from the Simkin site itself, including links to off-site articles.

If I do use Simkin, I should also think about using it as an on-board extension language!: I hadn't thought about having an scripting language in Gilgamesh, because it seems like unnecessary bloat, but if I get it for free it's worth thinking about.

Micromath

Some time ago, I looked at the Micromath library, which powers Microcalc. Microcalc is one of the big influences on this project, in terms of inspiration, and a benchmark to try to compete with. Rafe has suggested it'd be worth getting in touch with the Microcalc development team (Michael Zemljanukha) to discuss the two projects: now on my todo list!

J2ME doesn't support Floating Point maths. Micromath gives Fixed Decimal maths support, and is a very compact Java library. However I'm leaning towards using the MathFP library (mainly because it has better documentation.)

Frink

Alan Eliasen has posted a comment mentioning his project Frink. From a first quick read of the (nicely detailed) docs, you could describe it as a mix between an expression evaluator that knows about the world, a miscellany of facts and figures, and a programming language that, like Perl, tries to DWIM ('Do What I Mean') in some really clever ways.

In one of the wonderfully whimsical examples in the docs, he calculating the rainfall in the Biblical Great Flood using the formula:

depth = 29030 feet + 15 biblicalcubits - (2.4 inches/ year 4000 years)

Frink knows about feet, and inches, and years, and um... biblical cubits. Not to mention currencies, current and historical!

It also tracks the unit type of a calculation, making sure that you do sensible things with them: e.g. you can add 1 mile + 4 metres but it wouldn't make sense to add 1 mile + 4 volts. (2 metres) * (2 metres) will actually know to return an area in square metres! I love this idea: surely this should be mainstream?

Alan noted that he'd thought about using this in a Spreadsheet: I would love to see that happen, and I have to admit I'd be seriously tempted to use it in Gilgamesh. On the minus side:

Apparently it can be a little heavyweight: the data file is already large, and when it is (very cleverly) parsed to get all the relationships between the different data types, the resulting data structure is quite complex. Update: See Alan's comment for some clarification.
Though it's been ported to constrained devices, including my target platform, the SonyEricsson P800, it's been ported to PersonalJava rather than MIDP (see above).
I don't think it's designed to be plugged in to object models, worth investigating though.
It isn't open source. See the FAQ for a reasonable reason why not.

Posted by osfameron at 01:42 AM | Comments (2) | TrackBack

April 26, 2003

Formulas

I hinted in the entry on the Grid that each cell would contain a list of cells that it refers to in formulas.

    A   
 +------+
1| 1    |
 +------+
2|A1*2  | <-- references:  A1
 +------+
3|A1+A2 | <-- references:  A1,A2
 +------+

But of course this is an oversimplification. We don't just need a list of cells that we reference: we need to know how they interact with the formula. The formula will need to be parsed into a tree of data relating

Functions
Cell references
Constant data (numbers, strings, dates)

Some examples

(Don't pay too much attention to the pseudo-syntax)

A1+A2
 - function(+)
   - ref(A1)
   - ref(A2)  
if(A1==23 then "Yes" else "No)
 - function(if)
   - function(equals)
     - ref(a1)
     - constant(23)
   - constant("Yes")
   - constant("No")

The last example shows a more complicated example, demonstrating that parsing the syntax could be quite intense. Of course, even a standard, commercial product like Excel simplifies the syntax by making it more functional:

if(equals(A1, 23), "Yes", "No") // Excel syntax

As you can see, this is much more easily translated into the parse-tree above.

S-expressions

The easiest (for the implementer) way of having formulas easily parseable would be to use the Lisp syntax of S-expressions. 99% of Lisp commands are written like this

(function arg1 arg2 arg3 ...)
Converting the formulae above:
  (+ A1 A2)
  (if (equals A1 A2) "Yes" "No")

It looks a bit odd if you're not used to it, but it's trivial to build the parser. My first design idea is this:

If you write a formula beginning with parenthesis "(", then it will be parsed as an S-expression.

If you write a formula beginning with an equals "=", just like Excel and co, then it will be parsed more cleverly.

  =1 + 2               // arithmetic operators come inline
  =function(arg, arg)  // functions as standard
  =A1                  // reference unchanged

The S-expression parsing will get implemented first: we can worry about the syntactic sugar later

Whichever way the formula is specified, it will end up being parsed as

Reference to a Function object
Zero or more argmuents

Of course, any of the arguments may also be Formula objects in turn, thus creating the parse tree.

(if (= A1, 23), "Yes", "No)
- Formula:
  - Function:  if
  - Formula------------.
  - Constant: "Yes"     \
  - Constant: "No"       \
                    - Function:  =
                    - Reference: A1
                    - Constant:  23

The Function class will have a simple interface something like:

public abstract class Function {
  public String display;
  public boolean CheckArgs(Value[] args) {
    // ... (filled in by subclass)
  } 
  public Value execute(Value[] args) {
    // ... (filled in by subclass)
  }
}

and is extended by subclasses like so:

public class EqualsFunction extends Function {
  public EqualsFunction() {
    display = "=";
  }
  public boolean CheckArgs(Value[] args) {
    if (args.size == 2) {
      return true;
    } else {
      return false;
    }
  }
  public Value execute(Value[] args) {
    if args[0].equals(args[1]) {
      return BooleanValue.TRUE;
    } else {
      return BooleanValue.FALSE;
    }
  }
}

Common framework for data

NB: I'm assuming that all the possible data-types (Booleans, Formulas, Arguments, Strings, Cell References) inherit from a parent class (here Value). That's because we need to be able to pass them around, store them, and use them consistently.

This might well be too heavyweight for being on a J2ME constrained device. But I think it will have its advantages: we'll see as we go on and try to optimize later if we need to. More on this topic later.

Posted by osfameron at 10:27 AM | Comments (0) | TrackBack