F08CPSC559 Proxy

Overview
This is a report for the F08 CPSC 559 term project about a distributed proxy server system by Shane Clements, Paul Kirby, and Neil Tallim.

Admittedly, building a proxy system is a reasonably simple task, which is why we added an indexing tier that supports a tree structure, resource distribution, and fault tolerance to make it more suitable for presentation as a research topic.

Through this project, we sought to learn the basics of building a distributed system with some practical significance: the design involves multiple components that all have the capacity to do different things, but the end user won't see any of the infrastructure at work, no matter how much the plug-and-play distributed components change while in use.

What we designed
Our goal was to take the concept of a proxy and apply it in a distributed manner to create a system that would transparently accept a request from a web browser and choose, from a pool of proxies, the best path to provide service as quickly as possible.

One of the options we saw as a way of increasing performance was to implement proxy-side content caching, so we incorporated that into our design; however, because we wanted to emphasize our learning, we chose to consider it an optional feature that proxies could implement if their developers felt that there was something to be gained, which had the side-effect of making the concept much more interesting: efficiency could now be studied and load-balancing algorithms could take caching into consideration when deciding where to direct requests.

We also felt that we should support fault-tolerance, so we devised two topologies that could be used to ensure that a live proxy was always chosen to service requests: one was a star topology, in which any number of proxies were tethered to an indexing resource; another was a ring topology, in which proxies would extend the star-tether with head-and-tail linking to allow self-organization logic and make possible a bootstrap-and-repair algorithm to welcome new peers and bridge gaps when proxies went down.

Out of these ideas, we settled on the following keys for this project:
 * We would have two types of nodes: index servers and proxies
 * Index servers would be nodes that serve as hubs for any number of proxies
 * They would be the sole client-facing entities in the system, waiting for browser requests
 * They would implement load-balancing algorithms of their own design to control the flow of requests to the proxies they govern
 * They could be linked to other index servers to allow requests to be redirected to the proxy-cluster best-suited for addressing them
 * Proxies would be nodes that connect to one index server, using the star topology (because of time limitations)
 * They would be configurable by index servers upon connection
 * They would be free to implement caching
 * If caching were supported, they would need to respect any rules received from the index server
 * They would be responsible for computing their own suitability for retrieving arbitrary data identified by the index server

We also planned for a client-side daemon that would cause the client to change which index server it spoke to, depending on availability. This, of course, would have required a third type of node: an index server index, which would keep track of which index servers were up at any given time -- these nodes would also have to be mirrored on reliable hosts.

We also chose to parse the HTTP protocol upon receipt, causing index servers to read the browser's message, then communicate only the relevant data with proxies along an internal protocol to simplify the model and make it feasible in the time we had.

What we built
We first defined a minimal protocol that would allow us to demonstrate the functionality of our design to a proof-of-concept degree.

Our protocol, overview
Our system is built around a protocol communicated via TCP/IP. A sample message looks like "FITC!@ http://www.google.ca/\r\n ". Within the system, this means "Compute your fitness for providing the Google homepage in a timely matter, proxy. By the way, this is request #16353. Over." (Fitness scores are explained in the "Fitness algorithm" subsection)

On a technical level, messages are four-byte instruction codes followed by an instruction-specific payload. In this sample case, the request ID is a two-byte value, and it's followed by a URL.

Handshaking
Server: HI2U&lt;data-port as a 2-byte int&gt;&lt;optional XML configuration data&gt;

This tells the proxy where it will need to send retrieved data (more on that later), and it may include an XML block that tells the proxy what variables it needs to set to behave properly.

Latency and stability
Sender: PING&lt;payload, normally a UNIX timestamp&gt;

Recipient: PONG&lt;same payload&gt;

This allows both servers and proxies to find out what their communication latency happens to be, and it enables them to discover when their connections have been interrupted.

Fitness
Server: FITC&lt;request ID as a 2-byte int&gt;&lt;URL&gt;

Proxy: FITV&lt;request ID as a 2-byte int&gt;&lt;fitness from 0 through 99&gt;

This allows each proxy to let the index server know how well it thinks it can serve a request for the specified resource. The index server will intuitively choose the best-suited proxy once a response has been recorded for each proxy. (Proxies that take too long to respond are considered to have a value of 0)

Request handling
Server: RTRV&lt;request ID as a 2-byte int&gt;&lt;URL&gt;

This is sent to a proxy once the index server has chosen the one that will service the request, based on fitness scores.

Proxy: BODY&lt;request ID as a 2-byte int&gt;&lt;MIME-type&gt;&lt;null&gt;&lt;data&gt;

This is sent along a new connection established to the data-port specified in HI2U. This is done for two reasons: the first is that the data returned may be binary, so this allows us to avoid using special delimiter-protecting encoding – the socket is simply closed when done; the second is that the data to be returned may be very, very huge, and blocking the command-port during all of that time would be performance suicide.

Fitness algorithm
In this design, proxies have the option to implement a caching scheme. If they do, then they will respect the server-supplied rules governing freshness (whether a resource was read recently enough to be served to a client). These will, in turn, affect how fitness is calculated.

Only proxies that have a fresh copy of a resource will start with a fitness value of 99; everything else will work based on an initial score of 50. Proxies that do not have a fresh copy of the specified content will determine their retrieval penalty by taking the average B/s rate of their past N actions (where N is a value specified by the server) and comparing it to a server-supplied target. After that, based on PING/PONG latency, which is also compared to a server-supplied target, proxies will apply a modifier to their score before responding.

Only caching proxies may send scores of 99. This prevents high-performing proxies that do not have cached copies from taking priority. Proxies that take too long to respond are assigned a score of 0.

Weaknesses
First and foremost, the fitness algorithm is trusting. A malicious proxy could easily disrupt the system by reporting perfect scores.

The algorithm also allows for resources to go unused; this is addressed on the index server side by applying a nice-like algorithm that shores up the scores of proxies that would normally be passed over (possibly just because other proxies started reporting high responsiveness first), giving them a chance to prove that they are competent.

Maturity
The delivered code is capable of demonstrating the functionality of everything implied by the existence of the preceding protocol specification.

Incomplete features
At the time of delivery, the implementation of the system lacked support for communication between index servers, which hampers the scalability of the concept.

Additionally, due again to time constraints, it lacks a standardized method of handling support for any of the HTTP protocol beyond 1.0's GET.

Lastly, the XML configuration format has not been thoroughly defined, but it should be easy to extend without contradicting its design.

The value of this technology
This design is of purely academic interest; it is not particularly fast as envisioned right now (unless clients would typically have really slow access to the sites they are trying to access and reducing access paths to controlled channels is actually a better solution), and it is outclassed in any case where specialized service is required by systems engineered with that specialization in mind. Effectively, our design tries to be a little bit of everything for the sake of learning and proof-of-concept, and while it succeeds, it is very much a jack of all trades: master of none.

The value of this concept
Proxies are very versatile things. From the foundations of this project, proxies that, for example, filter content could easily be substituted for those that cache.

Using the presented model, index servers could instruct each proxy to cache only certain domains or TLDs to better spread load.

And, of course, what we have learned by exploring this design will be of value in the future, when each of us comes across a situation where a distributed architecture will be of use. We can say this because what was prepared is a real system with plenty of material to reflect on and learn from.

The onion router (Tor)
While highly dissimilar in intent -- Tor is an anonymity-driven project, while ours is based solely on some of the basic properties of resource distribution -- some parallels can be drawn between our design and that of Tor.

The primary similarity is the use of more than one layer in the relay structure; however, while Tor has every node in the network acting as both a client and server, allowing for a variable-length path to be drawn through the graph between browser and proxy, our model fixes the length at two nodes, each with a specific role, and working in a fixed direction. Our model, however, allows the path to change with each request, while Tor periodically resolves a new path and uses that while it is still viable and while the cryptography surrounding the channel is considered secure.

squid
Squid is a caching web proxy -- probably one of the best-known of its kind -- that provides highly optimized and extensible caching and filtering services to clients; it's so versatile and useful that some of us are using it on local systems (and network gateways) that already have direct connections to the Internet just because it does a great job of speeding up access to pages with dynamic text and lots of static layout elements, and because it is easy to configure it to filter out ad-spam.

Our technology differs from squid in that it offers a load-balancing layer that supports fault-tolerance (the index servers). While squid can indeed be configured so that multiple proxies can share resources, it is, without another managing process or clustering scheme, a system that clients will find fails if the node they are using goes down.