how-the-web-works

What is the World Wide Web?

Perhaps the first thing to establish in our discussion of the web is what exactly it is. This chapter will look in brief overview at the core technologies that go together to make the World Wide Web.

We can start by differentiating the Internet and the World Wide Web. These are often confused because the Web is the main use that most people have for the Internet and a common web browser is called “Internet Explorer”. However, we can properly distinguish between them. The Internet is a collection of inter-connected computers using the TCP/IP protocol to exchange information. The World Wide Web is a particular use of the Internet to exchange HTML web pages (and other documents) using the Hypertext Transfer Protocol (HTTP).

Let’s look briefly at the four basic ingredients of the Web:

TCP/IP - is a low level message protocol that is used to transfer messages between computers on the Internet.
HTTP - is used by a Web Client to make a request to a Web Server and for the server to return the response.
URL - is a way of writing down the address of something on the Web so that a browser can work out where to get it from.
HTML - is a language for writing web pages containing text, images and other content.

Together, these four technologies allow a web client - the web browser on your computer - to fetch pages from a web server anywhere in the world that might contain links to other documents and so on. It’s the links between documents that make this a Web and the Internet that allows it to be the World Wide Web.

Let’s look at each of these technologies in a little more detail; although we’ll explore most of them a lot more throughout the rest of this book.

TCP/IP and DNS

TCP/IP, the Transmission Control Protocol/Internet Protocol, is the standard way of exchanging messages on the Internet - in fact it effectively defines what the Internet is: a network of computers communicating via TCP/IP. For our purposes, there’s no need to have a deep understanding of the low level details of TCP/IP, though many of you will learn more about it if you study any networking topics. However there are a few higher level things that touch on how we use the Internet in the context of the World Wide Web.

A Protocol in this sense is a formal standard for how two machines will talk to each other over a communications channel. It describes what messages are allowed and what they mean and how data is transferred over the network. Later, we’ll look at HTTP which is a higher level protocol for web requests. TCP/IP deals with the low level exchange of data and doesn’t really care what the content of that data is.

The Internet is a collection of computer networks joined by physical network channels. Within an organisation, there may be a physical network based on the Ethernet standard (wired or wireless) which effectively connects all computers on the network to each other. Organisations connect to each other via DSL or cable connections. These inter-connected networks make up the Internet. The role of TCP/IP is to allow a computer within one organisation (your laptop) to establish a connection to a computer in another (a web server). Importantly, this connection is a point-to-point connection - like a private channel between the two computers, even though the data is carried by this shared network of networks.

TCP/IP is a packet oriented protocol. To send a message, it is broken up into small chunks (packets) which are each addressed and sent over the network. The receiving computer intercepts these packets, notices that they are addressed to them, and re-assembles the original message. Packets can arrive out of order (or not at all) and TCP/IP defines what the two communicating computers should do in this case.

Each computer on the network is assigned a unique IP address which is a 32 bit number usually expressed as four 8 bit digits separated by dots. For example 192.168.1.2 or 137.111.158.22. These numbers are used as the addresses of the packets sent around the Internet. Within an organisation, all computers will share the first part of their address; for example, all Macquarie University computers will have addresses starting with 137.111. This means that any packet sent from within Macquarie to an address in the range 137.111.x.x will find it’s destination somewhere inside the organisation. However, if we send a packet to 143.119.160.16 (the NSW Government website) the network protocols need to know that this packet should be forwarded to the NSW Government network before it can be delivered. We can pretend that this all happens by magic (this isn’t a networking course) and rest assured that a clever network routing algorithm will get the packets to where they need to go. As long as we know the IP address of the destination computer, we can establish a point-to-point channel and send data back and forth.

Within your home you might have your own private network, often established by a wireless router that connects to your ADSL or cable service. While this router will have a proper IP address, the network it establishes in your home is a private network and will use one of two address ranges: 192.168.x.x or 10.10.x.x. Both of these are reserved for private use, so that my laptop and your printer might have the same IP address of 192.168.100.13. A trick called Network Address Translation (NAT) carried out by your router allows each of these devices to connect to the Internet through the router, even though they don’t have a full IP address. Again, we don’t need to worry about the details but sometimes it’s useful to know how to communicate directly with devices on your own network, in which case you might start finding out about these private IP addresses.

A significant issue with the success and ubiquity of the Internet is that we might run out of unique addresses. Since an IP address is a 32 bit number, that means there are only 4,294,967,296 unique addresses. If every computer, mobile phone, printer and electricity meter is to be connected to the Internet, the it’s clear that more addresses will be needed. There are two responses to this. The first is that many of these devices share a single IP address (using NAT) which multiplies the number available significantly. THe second is a new standard called IPv6 (rather than IPv4 which I’ve described here) where addresses are 128 bits long. Most modern devices are able to use IPv6 addressing so the crisis is unlikely to hit us catastrophically.

IP addresses give a unique identifier for each device but they aren’t very easy to remember. We’re used to giving the names of web-sites via names like www.nsw.gov.au or sales.example.com. These names are translated into numerical IP addresses via the Domain Name System (DNS) which works using a clever hierarchical algorithm. For example, to work out what sales.example.com means our local DNS server would look at the last part of the address and forward the query to a server that it knows is authoritative for all addresses ending in .com. The .com server may not know which IP address corresponds to sales.example.com so it sends the query on to the DNS server for example.com which will respond with the answer. As the result is passed back to the original DNS server, it is cached (remembered) so that it can be returned more quickly the next time it is requested.

DNS allows an organisation to set up whatever names it needs and link those names to its servers. It’s common to have the main web server called www.example.com but the same server could also be referred to as example.com, sales.example.com or test.example.com. We’ll see later how this arrangement can be used to provide a lot of flexibility when setting up web servers.