TPR1: HTTP 101 - Or, What Exactly IS Under the Hood?

[Intro Music]

Announcer: You’re listening to one in a series of podcasts from the 2009 HighEdWeb Conference in Milwaukee, Wisconsin.

Jason Woodward: Good morning, all. Welcome to HighEdWeb, for those of you who weren't around yesterday.

So I've put up a webpage that includes a couple of links from this presentation. I'll be submitting the slide deck to the committee later so you can get the rest later. But if you've got a laptop here, there's a couple of interesting tools that you can play around with that are linked off of that page. And you can also see the results kind of of yesterday's workshop on jQuery and AJAX if you go to the root URL there.

I'm not going to cover everything about HTTP today because...two reasons. First of all, there's a whole lot of stuff to learn there. No way to cover it in 45 minutes. And second, I don't like that at all, but that's OK.

[Laughter]

Jason Woodward: I think there is only one person who knows it all. Or maybe five or six. Whoever is listed on the RFC.

I might be asking questions periodically if I think of them. If you get them right, you get a free Cornell mint. Anybody know where they are today?

Audience 1: The Walkway

Jason Woodward: All right. Who got that?

[Laughter]

Jason Woodward: All right. So we're going to spend some time, probably about 20 minutes, 25 minutes, half an hour, something like that, talking through HTTP, talking about how user agents--which is a technical term for your browser--and servers communicate.

And then we're going to play around with some live demos. I'm going to ask the audience for suggested websites. We're going to show you a visual representation of the communication between the browser and the server. All right?

Oh, can everybody hear me through the microphone? That's important for the recording.

So, if you've got your computer here, or for future reference, these are two great Firefox add-ons for looking at what is actually being sent back and forth between the web server and the client. They're linked to on the webpage I gave before. I'll be showing you them a little bit later today.

But before we get into that, we're going to establish a couple of pieces of terminology that are used throughout the HTTP RFC. The RFC is the document that finds how compliant HTTP clients and HTTP servers communicate with one another.

And these are useful for communicating with techie people. They are also useful for getting an established vocabulary throughout the rest of this section. It's very easy stuff. You've probably already heard them.

So I'm going to walk through a single HTTP transaction that talk about what happens when a user goes to their browser, types in a URL, hits 'enter', and then hopefully a few minutes later, if the network isn't down, they get a nice, pretty rendered webpage.

So, first thing. You have the user. That is me five or six years ago.

[Laughter]

Jason Woodward: The user keys into their--opens up their web browser. We're going to assume you all know what a web browser is. You've all seen the Google video of them asking on the streets of New York what's a web browser. Most people don't know what a web browser is. I'm going to assume you all know what a web browser is.

You key in the URL at the top, all right, and hit 'enter'. Well, in the HTTP parlance, your user and your browser is called a 'user agent'. Your browser is not the only program that can be a user agent. There are lots of different user agents out there.

There are client libraries that are used in programming languages so that you can write a piece of software that goes and hits websites. Google is a user agent. You probably heard the term 'user agent' if you've dealt with analytics or looking at web log part processing. You can see what kind of browsers people are using when they come to the site. But this we'll see in a little bit. That isn't necessarily completely accurate.

OK. So, user keys in www.hotelschool.cornell.edu, which is where I used to work, and hits 'enter'.

The first thing that happens: the user agent decodes that URL, sees if it is a well-formed URL. 'Well-formed' meaning it's got a scheme at the front. That's the bit that says HTTP, a colon, a couple of slashes, and then a host name. Assuming it's a name and not an IP address, it will then conduct a DNS lookup to change that name, www.hotelschool.cornell.edu, into what's called an IP address, which is a 32-bit number that computers that are connected to the internet use to identify themselves.

So for instance, www.hotelschool.cornell.edu resolves to 132.236.87.--I haven't worked there in a while--26. So the user agent will ask the operating system to initiate a TCP connection with 132.236 dadadadada. It opens a bi-directional channel.

At that point, the browser sends a line of text. And we're going to see this live in a little bit. But it looks just like that. This is, in fact, the exact HTTP request. And if you're going to say, "Where's the headers?" wait for it.

That is the exact HTTP request that that browser, that user agent, was sending across that TCP channel. TCP is a term for the protocol for connecting two machines over an IP network in a reliable, ordered connection. That is asking the HTTP RFC 26.16 and earlier versions specify that all those letters are literally just text.

So no funny coding. You can actually look at that with a wire sniffer like Wireshark. You can key that in in a program like Telnet that lets you send arbitrary text to another computer.

So that is literally called the 'request line'. It consists of a method, which is the first bit. Anybody who has done HTML forms probably is familiar with two terms that you'll see here a lot: 'get' and 'post'. Those aren't the only two.

Another common one is 'head', which instructs the server to not actually return the entire contents of the response but just to return submitted data about the response. There is a couple of other ones that are used less frequently by end users but more frequently by pieces of software that are programmatically interacting with the server.

In addition to that request line, a bunch of headers are sent. I have highlighted these. They are not underlined, not bolded. It's just text. I've highlighted these here to illustrate a couple of important ones.

Headers consist of a string identifying the header name, a colon, and then the value of that particular header. This is all on one line. They are separated by new lines. They are bits of metadata that the user agent is adding to the request that tells the server a little bit more about what the user agent is looking for.

First bit, a host. That was introduced in HTTP 101. It allows the user agent to identify to the server what the host name was in the URL when the user keyed it in.

It probably wasn't completely obvious to anybody who's not really familiar with this stuff already, but when I said the name is resolved to an IP address that is used to connect to the other computer, there's no bit in there that tells the other server on the other end what host name the user agent thinks it's connecting to. That's where that bit comes across.

If you've ever configured virtual hosts or anything like that in your web server, that's how the client the user agent identifies to the server which of one or more websites running on a particular IP address it's trying to connect to.

Next line is something you've probably all seen as well. That is where the user agent identifies to the server what kind of browser it is. And that's completely free form. There is nothing in there that says that the browser must actually honestly identify itself.

We'll come back to this a little bit later.

So, that is what a full HTTP request and headers looks like. If the request was opposed, in other words, submitting form data, it would have a body as well with the contents of that submitted form data or the file upload.

If it was, say, a form where you said 'method get'. In other words, it was a query portion of the URL, there's a question mark, and then some key value pairs after it, those would show up up here.

So the request and the request headers get sent over to the TCP channel to perhaps a cluster of servers. At the hotel school, that is what a cluster of servers looks like.

[Laughter]

Jason Woodward: The cluster of servers, with a very large smile and happy face, decide based upon that request in all those headers, what resource it wants to return to. This resource can be a bit of HTML. It can be a bit of Javascript. It can be a bit of CSS. It can be an image.

The server can decide whatever it wants to send back to you based upon the request it gets. The Web wouldn't really work if it decided to send you arbitrary data for any arbitrary request you sent it. So usually there's some useful mapping between what you asked for and what it decides to give you back.

But, really, there is no rule that says, "Oh, there has to be a file in the file system." Most of us, when we grew up building websites, we know we had a folder, and we'd put some files there, and they'd map to URLs in a web browser.

That just happens to be a convenient way to construct websites. One vial consists of one resource, consists of one URL. Very easy. First web servers were really written that way.

These days, if you look up at the URL when you go to Google or Facebook, you'll see that they very often don't seem to map to a particular file in the file system. Or if you've done any applications with any of the modern MVC frameworks, you'll see that, wow, those aren't really files in the file system anymore.

So, the web server computes a representation of the particular resource that you asked for. In this example, it was forward-slash. You might say, "Well, that's a directory." Well, we all know most web servers say, "OK, if they ask for a directory, look for an index study HTML." And that's pretty much what's happening here.

The server then, once it has computed the representation of that resource, in other words the HTML, the Javascript, the PNG, whatever happened that they asked for, it computes a response. That first line there in the response headers, which, again, are text. They look very much like the request headers. That is a status code that we'll come back to in a little bit.

And these are various bits of metadata about the response. We'll come back to them in a little bit, too. We'll talk a little bit more about content type and date.

Server again here, much like the user agent, is a completely arbitrary string. The server happened to tell us all the Apache modules that were installed on the machine. Who knows if that's right? It probably is. But you can't trust it.

For today, probably the most important part about that response--we're going to look at these a little bit later--and that matters to those of you who are not necessarily programmers but are more closer to the content creator and content management side, we're going to talk in a little bit of depth how that status code tells the user agent what to do with the body of the response.

Now at 200--by the way there's a whole list of these inside the RFC, which is in the first link I gave you. Very detailed. Tells you exactly what a compliant user agent is supposed to do. It's great reading. I advice you all to skip the next session and read it.

[Laughter]

Jason Woodward: OK, I'm just kidding. So, 200 in this case means "Everything was OK. Here's what you asked for."

The browser then takes that, and so the first request asked for a piece of extra slash, which ended up being a piece of HTML. Your browser takes that, decodes the HTML into a DOM tree, which is an internal memory representation of that HTML document, and decides to draw it graphically because your browser knows what to do with HTML files.

The 'gotcha' there is though that we only issued one request and we only got one HTML document. That HTML document does not consist of images. It does not consist of CSS. Lucky me, there's some CSS in there. It does not consist of external CSS. It does not consist of external Javascript. It does not consist of many of the pieces that we're familiar with that put together a webpage.

So really what happens is that there are more request and response pairs issued by the client, the user agent, to construct what the user is intended to see on that webpage. In other words, it gets that first bit of HTML, sees that there are 10 images it has to get, sees that there are two CSS files it has to download, two Javascripts. It rinses, it repeats, it keeps on doing that until it's got everything that it needs to render that page.

Sometimes, documents are not compound. For instance, if you have a PDF, your browser would only issue one request. If would get everything it needed back in that response to display you the PDF file.

Are there any questions at this point? And I ask that now because that's basically it, how HTTP transactions work.

There are lots of details in how a user agent can ask for different types of things from the server, how caching works, and we're going to get into that in a little bit. I alluded to extra status codes--in other words, how the server can signal to the user agent a different way to deal with the content. We're going to get to that in a second.

But I wanted to ask if there are any questions about that back and forth before we get there. Yes?

Audience 2: Quick question, about the slides back. If you could go and send Java, can you send a live portion of the words.

Jason Woodward: Yup. Would you mind if I waited for that? Because that's kind of advanced. And I could tell you exactly what those mean, but there's a little bit later I'm going to get to the rest of the headers and we can touch on that then.

Audience 2: I just want to know your web address

Jason Woodward: Sure! Here we are. Heweb09.jdwcornell.com/tpr1.html. Sure!

Audience 3: Is there winner for the HTML...?

Jason Woodward: That is browser-dependent. And the short answer to that is, it's usually in the order that the browser encounters it when parsing the initial HTML that comes back.

There are all sorts of advanced tips and tricks for optimizing that so it doesn't do it sequentially so they will download them at the same time. That's beyond the scope of this. So pretty much you can think sequentially. But it's not necessarily the case. There's nothing about the HTTP spec that dictates that that's what has to happen.

Everybody got it? All right.

So, talking about some of these status codes. We're going to cover four, five of these today. A couple of really important ones for those of you who have had to do any website redesign. How many folks here have done a website redesign or moved URL? See? Yeah, everybody.

And when you put up the new website, all of the sudden, all your search results on Google don't work anymore, right? Well, it depends. If you set up redirects from your old URLs to wherever that new content lives, they will work.

But you also notice that sometimes the Google results won't actually change over to the new URLs very quickly. And that is because there is this distinction between two different types of redirects in HTTP protocol.

Remember we saw that status code, 200? In the case where the server is saying, "No, this content isn't here. It's someplace else," the status code will be a 301 or a 302. Off the top of my head, I'm not sure if there is any more. But we're just going to look at these two.

They mean very different things. A 302 means the server is saying, "What you ask for is not here. It's someplace else. But that's just temporary. Next time you ask for this resource, it might be yet another place. Or it might be here again." This is usually used for log-in pages. You ask for a deep link into a site. The server goes, "Oh, you don't have cookies. You're not logged in." It will redirect you to the log-in page, you go through that process, and then you come back to the original one.

301, however, is very useful if you've redesigned your site and you want to set up a mapping between old URLs and new URLs. You want to tell folks like Google, "This old URL? Not there anymore. We've moved it permanently to another one." If you configure your web server to 301 redirect, Google will pick up on that a lot faster. It will know that old URL is bad. "It's no longer where this particular content is. Now I should look over here."

And when I say Google, I also mean all those other search engines too that, I don't know, I guess some other people use. Nobody from a search engine vender here today, is there?

[Laughter]

Jason Woodward: OK. There are pages on Google.com in their help section and on Yahoo.com as well that explain this. Usually if you look into their SEO pages or webmaster forums--I did not include links to them on my page, but if somebody wants to email me or DM me or whatever it is the kids are doing these days to communicate, I'd be happy to point it out to them.

Let's skip this for a second and go on to that one, and then we'll come back to the other one.

So, these are two other status codes that are useful for communicating with... I say search engines. A better term might be to say 'non-human user agents'. So Google bot is a user agent, but it's not a human. It's not intended to render pictures on some Google server that people look at somewhere and key in maybe the contents of the webpage. No, that's not how it works. Google's entirely a bunch of computers that are going to take over the world.

So, this is a way, again, that status code can signal from the server, you as the server administrator or the website administrator, "Hey, this document isn't here anymore." There's two different ways to signal that.

404 is the one that we are all used to. If you read the spec, it actually means something very specific that is not necessarily what you intended. 404 means server doesn't even know what you're asking for. Never seen it, don't know what you're talking about.

A 410 is a more specific "This document is not here." It is "Yup, that was there before. Maybe before the redesign. Maybe before you removed some content. But it's not there anymore. So don't bother coming asking me for it." Or, perhaps more usefully, "Don't have this show up in your search results anymore for my site."

I would also imagine those of you who use things like Google, the GSA, Google Mini search appliance, other local tour university campus search engines, they also read the status codes the same way that the major search engines do.

So these are ways that you can flush out maybe one of your unit's redesigns faster. Although most of us I know--I know at Cornell, whenever we do a redesign, we call up the GSA administrator and say, "Could you just delete that index?" Then it's gone for a day or two and then it re-indexes.

This is a less destructive way of doing that. And more useful for the bigger search engines out there because you cannot call up Google and say, "Hey, could you flush my website?"

[Laughter]

Jason Woodward: Did your server go down?

[Laughter]

Jason Woodward: You better get on that. It is 9 am on a Monday morning. What's that? Exactly!

Audience 4: I can think of four or five things.

Jason Woodward: I would've said when the boss shows up, but, you know, I'm the boss now. I was a software engineer for 10 years, now I've been a manager for a year. So I'm not at work. So the server can't go down, then.

So, with the exception of one joke status code, those are the only status codes I'm going to cover today. Now I'm going to cover some of those headers. Remember we saw the HTTP request that consisted of the request line, and then submitted data about that request?

And the response, which consisted of the response line and submitted data about that response? All those extra bits of metadata mean something to the server, to the user agent, and to proxies in the middle.

So coming back to giving you something useful content people that you might be able to take away from this, because I know you're not going to go onto the Telnet command line and look up IP addresses and talk to the servers directly. It's just useful to know that stuff.

Here are some bits that control how user agents and intermediate proxies decide whether or not to go back to the server each time they request a document. There's an expires header on the HTTP response. Let's see if we can go back to that.

Now notice there isn't one here. These are not mandatory headers. In fact, if I'm not mistaken, there are no mandatory response headers. Just the response. Yeah, which is not a response header, if we want to get very technical. That is literally the response. And the rest are response headers.

So expires might show up in here. This one does not have it. Oh, there we are. Oh, you saw the joke slide. Oh, well. So, the expires header says... The server says, "Hey, I know this document is going to be good for a week or a day or six months." But, say, it's an image, like a spacer GIF that never changes. You could put an expires on that of 20 years from now. And hopefully, every compliant intermediate cache and end user or end user agent will see the fact that this content was tagged as 'expires in 20 years' and it sits in their cache forever.

A compliant user agent would then, next time the end user decided to load that webpage, would look into the cache and say, "Oh, I need that bit of information, but the server told me that it expires in 20 years. I'm not even going to bother asking the server for it."

So if you configure expires headers correctly on your content, you can reduce server load and increase speed at which the end user loads your website. Your website renders for the end user.

'Last modified' is a little bit different from that. This is something that might be read off the file system. A 'last modified' timestamp, if that's what your web server is doing. If you are on a modern CMS, it will probably look at the last time you modified some of the text fields, let's say, on all the different editable areas on that particular page. Or maybe the last time you updated the template for it.

The server in the response will say, "This was last modified last Monday." How is that useful? The next time the user agent decides to go to this webpage and requests this resource, it can send a request header called 'if modified since'. And it will say, "If modified since last Monday."

And then the server in its response, instead of sending a 200, which remember was the "Oh, everything's OK. Here's the body of the content," it could send a 304. Not modified. So the server sends a very small response saying, "Hey, this hasn't been modified since the time you asked me about."

So let's say it was a 5 meg PDF. Here you saved downloading another 5 meg PDF, so you get faster response time for your rendering your page. PDF, bad example for that. Large images on pages, much better example.

Cache control headers show up both ways. I won't go into the details of what goes into a cache control header, but those are used for either the server telling the user agent whether or not this content should be cached. A lot of times, if you're building internet style sites, you will want to say 'not cache' your SSL pages that include Social Security Numbers. Something like that. So you can configure your server to say, "Don't cache this stuff."

Sending from the client to the server, the user agent can say, "It's OK if you serve this to me out of your cache." So let's say you're using a CMS that takes 5 seconds to render a page. You might say to the server, "Yeah, it's OK to pull it out of your cache," instead of re-rendering it.

So, HTTP 101. That's what we all came here to learn, right? Last time I gave this at HighEdWeb, it was a great punchline for the end because there is actually an HTTP status code 101. And there is a cut-and-paste from the RFC. It says, "Switching protocols."

At that time, I said, "I've never seen this used. Nobody uses this." It essentially allows the client and the server to negotiate a different version of the HTTP protocol or maybe even a different protocol completely. For instance, a media streaming protocol.

Well, it turns out that there's one major piece of software out there that uses this now. They're calling it Reverse HTTP.

You can go to--there's a link on the links page. The Second Life client uses this. So , you got to know a lot about network topology to know why this is super useful, but essentially the Second Life client will connect to the Second Life servers. They will switch roles on that existing TCP channel.

All of the sudden, the Second Life server can issue a request to the client and say, "Get me some information." So now the client, your client running on your desktop, is now a server. And the server on the other end is now the client. So they sort of switch directions.

So I got flashed the 10-minute sign, and at this point, I'd like to pop up Firefox, go to a demo webpage, show you how you can use it to inspect the HTTP headers and show you how you can use it to get a little neat graphical representation of the multiple HTTP requests that come down when one particular webpage is rendered.

And then if we have some more time after that, we'll come back and talk about more of the esoteric HTTP headers such as the keep-alive.

So, who's got a webpage they want us to demo here?

Audience 5: Needforfeed.com.

Jason Woodward: Everybody OK with that? All right. Yes, I know. I know it mostly because I did a workshop on jQuery and AJAX, and I was going to build something like that in that workshop. And I decided as soon as I heard that it came out that I couldn't copy that. I had to do something different. So we did something different.

So, I'm going to pop up the little Firebug tool down here. I'm going to click on the net tab. Click on 'all'. Then we're going to hit 'enter'. Oh, fail.

Well, here's what happened. I hit Needforfeed.com. You saw that the Firebug pop-up is here? It did get a response, which was a 302 re-direct to this other website, which does not have Firebug activated on it. So, we're going to pop this up here, and we're going to refresh this page so we can see something a little bit more entertaining.

Lots of requests going on here. If we ran YSlow on this, they would have a field day.

[Laughter]

Jason Woodward: No, seriously, it's a great site. It's an awesome idea. What's that? Oh, OK. Well, then, it's awful.

[Laughter]

Jason Woodward: No, I'm just kidding. So, initial request. Firebug is telling us we've requested /needforfeed/--or actually /informatic/needforfeed. Firebug sort of shortens this here. I'll show you another tool in a second that gives you the whole thing.

Response, 200. OK. Here's the HTML. This little bar over here represents the time start the request to the time it was done rendering the... I think it was actually done parsing the dump, which is the very technical thing that's the HTML behind this document.

This represents the time at which it was done downloading. And these other bars represent when these particular dependent documents, such as a CSS and Javascript, when the buzzer began downloading them and when it finished. So you can kind of see here visually that these four documents are probably referenced inside needforfeed.js because the browser doesn't start getting them until it's done interpreting that particular file.

If you're really big into site speed optimization, there are techniques that you can use to parallelize all of those downloads a lot earlier so the page will render faster.

So, you can also see here that the response for this particular file was '304 not modified'. That meant that the response was already in my cache. The headers that it issued on that request--cookies, referrals, these are all things you've heard about before--if modified since... It says, "If modified since September 10th," because my browser had it in its cache because I've already been to this site.

The server responds... Oops. Oh, it's right, Firebug puts it up here. The server responds '304 not modified', submitted data about it. But really it's saying you don't actually have to download the entire content anymore. You can just read it out of your cache.

If I hit 'shift refresh' on this page, browser-dependent, browser-dependent shift on Firefox says, "Ignore your local cache. Get everything." And you can see down here, everything is becoming a 200 because the browser is not issuing 'if modified since requests'.

This is a pretty handy tool. If you're doing any front-end web development, this is indispensable. If you are a content person, it's still pretty handy for some of the visual representation of what your dependencies are in here. So I recommend getting it if you don't already have it.

Last thing we're going to do, before we open it up for more questions, is...just show you another plug-in called Live HTTP Headers. And we're going to clear this and we're going to go to Google because they've got less stuff.

So we've hit the Google front page here. Here you see what actually goes over the wire with the exception of the response bodies. So there is a get of a slash. My browser is issued all kinds of different headers including a cookie, a user agent. Google web server comes back and says, "OK, expires negative one." Huh.

Probably translates into December 31st, 1969. You can ask me later why if you don't know that. Identifies itself with the web server, gives you a content length, which identifies how long the body is, in this case it's the HTML for the Google front page, and it is 3500 bytes.

This tool doesn't show you the HTML content of the response body. It also identifies the content type, which is metadata saying, "Here's how you're supposed to interpret what's in this response." The browser does not look at file name endings, completely ignores that, except for IE. IE will do that sometimes.

It actually is supposed to be looking at the content type of the response. In this case, it says it's HTML. In this case, it's an image. It's got a 'last modified' of June 2006. It's got 'expires' of looks like roughly a year and a day from now.

So this happens every request. Always back and forth. It's really fast, isn't it?

Two key reminders. Each HTTP transaction is stateless, meaning when a browser issues a request, the server must only use the contents of that request and any of its internal programming to compute the value of the response. In other words, it doesn't remember that you made this request two days ago as well or you just made another request for these other documents. It looks at each of these requests independently from one another.

You may say, "Well, how do sites remember I've been logged in?" Well, they use cookies. Cookies are part of the request. But the point there is that these HTTP transactions are completely independent of one another.

Second, the RFC defines how compliant HTTP user agents and servers communicate with one another. There's no governing body that says if you want to make a web browser or a server that doesn't actually speak HTTP correctly, in practice, though, the internet would not work if you did not.

And with that, I've already done demos, asked for the URL, asked for questions, my time is up. I'm happy to answer any more questions you might have.

[Applause]

Jason Woodward: Thank you.

TPR1: HTTP 101 - Or, What Exactly IS Under the Hood?

Jason Woodward, Director of Administrative Computing, Cornell University