The audio for this podcast can be downloaded at http://highedweb.org/2009/presentations/tpr1.mp3
[Intro Music]
Announcer: You’re listening to one in a series of podcasts from the 2009 HighEdWeb Conference in Milwaukee, Wisconsin.
Jason Woodward: Good morning,
all.
Welcome to HighEdWeb, for those of you who weren't around yesterday.
So
I've put up a webpage that includes a couple of links from this
presentation. I'll be submitting the slide deck to the committee later
so you can get the rest later. But if you've got a laptop here, there's
a couple of interesting tools that you can play around with that are
linked off of that page. And you can also see the results kind of of
yesterday's workshop on jQuery and AJAX if you go to the root URL
there.
I'm not going to cover everything about HTTP today because...two
reasons.
First of all, there's a whole lot of stuff to learn there. No way to
cover it in 45 minutes. And second, I don't like that at all, but that's OK.
[Laughter]
Jason Woodward: I think there
is
only one person who knows it all. Or maybe five or six. Whoever is
listed on the RFC.
I might be asking questions periodically if I think of them. If you get
them right, you get a free Cornell mint. Anybody know where
they are
today?
Audience 1: The Walkway
Jason Woodward: All right.
Who got
that?
[Laughter]
Jason Woodward: All right. So
we're going to spend some time, probably about 20
minutes, 25 minutes, half an hour, something like that, talking through HTTP,
talking about how user agents--which is a technical term for your
browser--and servers communicate.
And then we're going to play around
with some live demos. I'm going to ask the audience for suggested
websites. We're going to show you a visual representation of the
communication between the browser and the server. All right?
Oh, can everybody hear me through the microphone? That's
important for the recording.
So, if you've got your computer here, or for future reference,
these are two great Firefox add-ons for looking at what is actually
being sent back and forth between the web server and the client.
They're linked to on the webpage I gave before. I'll be showing you
them a
little bit later today.
But before we get into that, we're going to establish a couple
of pieces of terminology that are used throughout the HTTP RFC. The RFC
is the document that finds how compliant HTTP clients and HTTP servers
communicate with one another.
And these are useful for communicating with techie people.
They are also useful for getting an established vocabulary throughout
the rest
of this section. It's very easy stuff. You've probably already heard
them.
So I'm going to walk through a single HTTP transaction that
talk about what happens when a user goes to their browser, types in a
URL, hits 'enter', and then hopefully a few minutes later, if the
network isn't down, they get a nice, pretty rendered webpage.
So, first thing. You have the user. That is me five or six
years ago.
[Laughter]
Jason Woodward: The user keys
into their--opens up their web browser. We're
going to assume you all know what a web browser is. You've all seen the
Google video of them asking on the streets of New York what's a web
browser. Most people don't know what a web browser is. I'm going to
assume you all know what a web browser is.
You key in the URL at the top, all right, and hit 'enter'.
Well, in the HTTP parlance, your user and your browser is called a
'user agent'. Your browser is not the only program that can be a user
agent. There are lots of different user agents out there.
There are client libraries that are used in programming
languages so that you can write a piece of software that goes and hits
websites. Google is a user agent. You probably heard the term 'user
agent' if you've dealt with analytics or looking at web log part
processing. You can see what kind of browsers people are using when
they come to the site. But this we'll see in a little bit. That isn't
necessarily completely accurate.
OK. So, user keys in www.hotelschool.cornell.edu, which is
where I used to work, and hits 'enter'.
The first thing that happens: the user agent decodes that URL,
sees if it is a well-formed URL. 'Well-formed' meaning
it's got a scheme at the front. That's the bit that says HTTP, a colon,
a couple of slashes, and then a host name. Assuming it's a name and not
an IP address, it will then conduct a DNS lookup to change that name,
www.hotelschool.cornell.edu, into what's called an IP address, which is
a 32-bit number that computers that are connected to the internet use
to identify themselves.
So for instance, www.hotelschool.cornell.edu resolves to 132.236.87.--I haven't worked there in a while--26. So the user agent will ask the operating system to initiate a TCP connection with 132.236 dadadadada. It opens a bi-directional channel.
At that point, the browser sends a line of text. And we're
going to see this live in a little bit. But it looks just like that.
This is, in fact, the exact HTTP request. And if you're going to say,
"Where's the headers?" wait for it.
That is the exact HTTP request that that browser, that user
agent, was sending across that TCP channel. TCP is a term for the
protocol for connecting two machines over an IP network in a reliable,
ordered connection. That is asking the HTTP RFC 26.16 and earlier
versions specify that all those letters are literally just text.
So no
funny coding. You can actually look at that with a wire sniffer like
Wireshark. You can key that in in a program like Telnet that lets you
send arbitrary text to another computer.
So that is literally called the 'request line'. It consists of
a method, which is the first bit. Anybody who has done HTML forms
probably is familiar with two terms that you'll see here a lot: 'get'
and
'post'. Those aren't the only two.
Another common one is 'head', which instructs the server to
not actually return the entire contents of the response but just to
return submitted data about the response. There is a couple of other
ones that are used less frequently by end users but more frequently by
pieces of software that are programmatically interacting with the
server.
In addition to that request line, a bunch of headers are sent. I have highlighted these. They are not underlined, not bolded. It's just text. I've highlighted these here to illustrate a couple of important ones.
Headers consist of a string identifying the header name, a
colon, and then the value of that particular header. This is all on one
line. They are separated by new lines. They are bits of metadata that
the user agent is adding to the request that tells the server a little
bit more about what the user agent is looking for.
First bit, a host. That was introduced in HTTP 101. It allows the user agent to identify to the server what the host name was in the URL when the user keyed it in.
It probably wasn't completely obvious to anybody who's not
really familiar with this stuff already, but when I said the name is
resolved to an IP address that is used to connect to the other
computer, there's no bit in there that tells the other server on the
other end what host name the user agent thinks it's connecting to.
That's where that bit comes across.
If you've ever configured virtual hosts or anything like that
in your web server, that's how the client the user agent identifies to
the server which of one or more websites running on a particular IP
address it's trying to connect to.
Next line is something you've probably all seen as well. That
is where the user agent identifies to the server what kind of browser
it is. And that's completely free form. There is nothing in there that
says that the browser must actually honestly identify itself.
We'll come back to this a little bit later.
So, that is what a full HTTP request and headers looks like.
If the request was opposed, in other words, submitting form data, it
would have a body as well with the contents of that submitted form data
or the file upload.
If it was, say, a form where you said 'method get'. In other
words, it was a query portion of the URL, there's a question mark, and
then some key value pairs after it, those would show up up here.
So the request and the request headers get sent over to the
TCP channel to perhaps a cluster of servers. At the hotel school, that
is what a cluster of servers looks like.
[Laughter]
Jason Woodward: The cluster
of
servers, with a very large smile and happy face, decide based upon that
request in all those headers, what resource it wants to return to. This
resource can be a bit of HTML. It can be a bit of Javascript. It can be
a bit of CSS. It can be an image.
The server can decide whatever it wants to send back to you
based upon the request it gets. The Web wouldn't really work if it
decided to send you arbitrary data for any arbitrary request you sent
it. So usually there's some useful mapping between what you asked for
and what it decides to give you back.
But, really, there is no rule that says, "Oh, there has to be
a file in the file system." Most of us, when we grew up building
websites, we know we had a folder, and we'd put some files there, and
they'd map to URLs in a web browser.
That just happens to be a convenient way to construct
websites. One vial consists of one resource, consists of one URL. Very
easy. First web servers were really written that way.
These days, if
you look up at the URL when you go to Google or Facebook, you'll see
that
they very often don't seem to map to a particular file in the file
system. Or if you've done any applications with any of the modern MVC
frameworks, you'll see that, wow, those aren't really files in the file
system anymore.
So, the web server computes a representation of the particular
resource that you asked for. In this example, it was forward-slash. You
might say, "Well, that's a directory." Well, we all know
most web
servers say, "OK, if they ask for a directory, look for an index study
HTML." And that's pretty much what's happening here.
The server then, once it has computed the representation of
that resource, in other words the HTML, the Javascript, the PNG,
whatever happened that they asked for, it computes a response. That
first line there in the response headers, which, again, are text. They
look very much like the request headers. That is a status code that
we'll come back to in a little bit.
And these are various bits of metadata about the response.
We'll come back to them in a little bit, too. We'll talk a little bit
more about content type and date.
Server again here, much like the user agent, is a completely
arbitrary string. The server happened to tell us all the Apache modules
that were installed on the machine. Who knows if that's right? It
probably is. But you can't trust it.
For today, probably the most important part about that
response--we're going to look at these a little bit later--and that
matters to those of you who are not necessarily programmers but are
more closer to the content creator and content management side, we're
going to talk in a little bit of depth how that status code tells the
user
agent what to do with the body of the response.
Now at 200--by the way there's a whole list of these inside
the RFC, which is in the first link I gave you. Very detailed. Tells
you exactly what a compliant user agent is supposed to do. It's great
reading. I advice you all to skip the next session and read it.
[Laughter]
Jason Woodward: OK, I'm just
kidding. So, 200 in this case means "Everything
was OK. Here's what you asked for."
The browser then takes that, and so the first request asked
for
a piece of extra slash, which ended up being a piece of HTML. Your
browser takes that,
decodes the HTML into a DOM tree, which is an internal memory
representation of that HTML document, and decides to draw it
graphically because your browser knows what to do with HTML files.
The 'gotcha' there is though that we only issued one request
and we only got one HTML document. That HTML document does not consist
of images. It does not consist of CSS. Lucky me, there's some CSS in
there. It does not consist of external CSS. It does not consist of
external Javascript. It does not consist of many of the pieces that
we're
familiar with that put together a webpage.
So really what happens is that there are more request and
response pairs issued by the client, the user agent, to construct what
the user is intended to see on that webpage. In other words, it gets
that first bit of HTML, sees that there are 10 images it has to get,
sees that there are two CSS files it has to download, two Javascripts.
It
rinses, it repeats, it keeps on doing that until it's got everything
that it needs to render that page.
Sometimes, documents are not compound. For instance, if
you have a PDF, your browser would only issue one request. If would get
everything it needed back in that response to display you the PDF file.
Are there any questions at this point? And I ask that now
because that's basically it, how HTTP transactions work.
There are lots of details in how a user agent can ask for
different types of things from the server, how caching works, and we're
going to get into that in a little bit. I alluded to extra status
codes--in other words, how the server can signal to the user agent a
different
way to deal with the content. We're going to get to that in a second.
But I wanted to ask if
there are any questions about that back and forth before we get there.
Yes?
Audience 2: Quick question, about the slides back. If you could go and send Java, can you send a live portion of the words.
Jason Woodward: Yup. Would you
mind if I
waited for that? Because that's kind of advanced. And I could tell you
exactly what those mean, but there's a little bit later I'm going to
get to the rest of the headers and we can touch on that then.
Audience 2: I just want to know your web address
Jason Woodward: Sure! Here we
are.
Heweb09.jdwcornell.com/tpr1.html. Sure!
Audience 3: Is there winner for the HTML...?
Jason Woodward: That is
browser-dependent. And the short answer to that is, it's usually in the
order that the browser encounters it when parsing the initial HTML that
comes back.
There are all sorts of advanced tips and tricks for
optimizing that so it doesn't do it sequentially so they will download
them
at the same time. That's beyond the scope of this. So pretty much you
can think sequentially. But it's not necessarily the case. There's
nothing about the
HTTP spec that dictates that that's what has to happen.
Everybody got it? All right.
So, talking about some of these status codes. We're going to
cover four, five of these today. A couple of really important ones for
those of you who have had to do any website redesign. How many folks
here have done a website redesign or moved URL? See? Yeah, everybody.
And when you put up the new website, all of the sudden, all
your search results on Google don't work anymore, right? Well, it
depends. If you set up redirects from your old URLs to wherever that
new content lives, they will work.
But you also notice that sometimes the Google results won't
actually change over to the new URLs very quickly. And that is
because there is this distinction between two different types of
redirects in HTTP protocol.
Remember we saw that status code, 200? In the case where the
server is saying, "No, this content isn't here. It's someplace else,"
the status code will be a 301 or a 302. Off the top of my head, I'm not
sure if there is any more. But we're just going to look at these two.
They mean very different things. A 302 means the server is
saying, "What you ask for is not here. It's someplace else. But that's
just temporary. Next time you ask for this resource, it might be
yet another place. Or it might be here again." This is usually used for
log-in pages. You ask for a deep link into a site. The server goes,
"Oh, you don't have cookies. You're not logged in." It will redirect
you to the log-in page, you go through that process, and then you come
back to the original one.
301, however, is very useful if you've redesigned your site
and you want to set up a mapping between old URLs and new URLs. You
want to tell folks like Google, "This old URL? Not there anymore. We've
moved it permanently to another one." If you configure your web server
to 301 redirect, Google will pick up on that a lot faster. It will
know that old URL is bad. "It's no longer where this particular content
is. Now I should look over here."
And when I say Google, I also mean all those other search
engines too that, I don't know, I guess some other people use. Nobody
from a search engine vender here today, is there?
[Laughter]
Jason Woodward: OK. There are
pages on
Google.com in their help section and on Yahoo.com as well that explain
this. Usually if you look into their SEO pages or webmaster forums--I
did not include links to them on my page, but if somebody wants to
email me or DM me or whatever it is the kids are doing these days to
communicate, I'd be happy to point it out to them.
Let's skip this for a second and go on to that one, and then
we'll come back to the other one.
So, these are two other status codes that are useful for
communicating with... I say search engines. A better term might be to
say 'non-human user agents'. So Google bot is a user agent, but it's
not
a human. It's not intended to render pictures on some Google server
that people look at somewhere and key in maybe the contents of the
webpage. No, that's not how it works. Google's entirely a bunch of
computers that are going to take over the world.
So, this is a way, again, that status code can signal from the
server, you as the server administrator or the website
administrator, "Hey, this
document isn't here anymore." There's two different ways to signal that.
404 is the one that we are all used to. If you read the spec,
it actually means something very specific that is not necessarily what
you intended. 404 means server doesn't even know what you're asking
for. Never seen it, don't know what you're talking about.
A 410 is a more specific "This document is not here." It is
"Yup, that was there before. Maybe before the redesign. Maybe before
you removed some content. But it's not there anymore. So don't bother
coming asking me for it." Or, perhaps more usefully, "Don't have this
show up in
your search results anymore for my site."
I would also imagine those of you who use things like Google,
the GSA, Google Mini search appliance, other local tour university
campus search engines, they also read the status codes the same way
that the major search engines do.
So these are ways that you can flush out maybe one of your
unit's redesigns faster. Although most of us I know--I know at Cornell,
whenever we do a redesign, we call up the GSA administrator and say,
"Could you just delete that index?" Then it's gone for a day or two
and then it re-indexes.
This is a less destructive way of doing that. And more useful
for the bigger search engines out there because you cannot call up
Google and say, "Hey, could you flush my website?"
[Laughter]
Jason Woodward: Did your server go down?
[Laughter]
Jason Woodward: You better
get on
that. It is 9 am on a Monday morning. What's
that? Exactly!
Audience 4: I can think of four or five things.
Jason Woodward: I
would've said when the boss shows up, but, you know, I'm the boss now.
I was a software engineer for 10 years, now I've been a manager for a
year. So I'm not at work. So the server can't go down, then.
So, with the exception of one joke status code, those are the
only status codes I'm going to cover today. Now I'm going to cover some
of those headers. Remember we saw the HTTP request that consisted of
the request line, and then submitted data about that request?
And the response, which consisted of the response line and
submitted data about that response? All those extra bits of metadata
mean something to the server, to the user agent, and to proxies in the
middle.
So coming back to giving you something useful content people
that you might be able to take away from this, because I know you're
not going to go onto the Telnet command line and look up IP addresses
and talk to
the servers directly. It's just useful to know that stuff.
Here are some bits that control how user agents and
intermediate
proxies decide whether or not to go back to the server each time they
request a document. There's an expires header on the HTTP response.
Let's
see if we can go back to that.
Now notice there isn't one here. These are not mandatory
headers. In fact, if I'm not mistaken, there are no mandatory response
headers. Just the response. Yeah, which is not a response header, if we
want to get very technical. That is literally the response. And the
rest are response headers.
So expires might show up in here. This one does not have it. Oh, there we are. Oh, you saw the joke slide. Oh, well. So, the expires header says... The server says, "Hey, I know this document is going to be good for a week or a day or six months." But, say, it's an image, like a spacer GIF that never changes. You could put an expires on that of 20 years from now. And hopefully, every compliant intermediate cache and end user or end user agent will see the fact that this content was tagged as 'expires in 20 years' and it sits in their cache forever.
A compliant user agent would then, next time the end user
decided to load that webpage, would look into the cache and say, "Oh, I
need that bit of information, but the server told me that it expires in
20
years. I'm not even going to bother asking the server for it."
So if you configure expires headers correctly on your content,
you can reduce server load and increase speed at which the end user
loads your website. Your website renders for the end user.
'Last modified' is a little bit different from that. This is something that might be read off the file system. A 'last modified' timestamp, if that's what your web server is doing. If you are on a modern CMS, it will probably look at the last time you modified some of the text fields, let's say, on all the different editable areas on that particular page. Or maybe the last time you updated the template for it.
The server in the response will say, "This was last modified
last Monday." How is that useful? The next time the user agent decides
to go to
this webpage and requests this resource, it can send a request header
called 'if modified since'. And it will say, "If modified since last
Monday."
And then the server in its response, instead of sending a 200,
which remember was the "Oh, everything's OK. Here's the body of the
content," it could send a 304. Not modified. So the server sends a very
small response saying, "Hey, this hasn't been modified since the time
you asked me about."
So let's say it was a 5 meg PDF. Here you saved downloading
another 5 meg PDF, so you get faster response time for your
rendering your page. PDF, bad example for that. Large images on pages,
much better example.
Cache control headers show up both ways. I won't go into the
details of what goes into a cache control header, but those are used
for either the server telling the user agent whether or not this
content should be cached. A lot of times, if you're building internet
style sites, you will want to say 'not cache' your SSL pages that
include
Social Security Numbers. Something like that. So you can configure your
server to say, "Don't
cache this stuff."
Sending from the client to the server, the user agent can say, "It's OK if you serve this to me out of your cache." So let's say you're using a CMS that takes 5 seconds to render a page. You might say to the server, "Yeah, it's OK to pull it out of your cache," instead of re-rendering it.
So, HTTP 101. That's what we all came here to learn, right?
Last time I gave this at HighEdWeb, it was a great punchline for the
end because there is actually an HTTP status code 101. And there is a
cut-and-paste from the RFC. It says, "Switching protocols."
At that
time, I said, "I've never seen this used. Nobody uses this." It
essentially allows the client and the server to negotiate a different
version of the HTTP protocol or maybe even a different protocol
completely. For instance, a media streaming protocol.
Well, it turns out that there's one major piece of software
out there that uses this now. They're calling it Reverse HTTP.
You can go to--there's a link on the links page. The Second
Life client uses this. So , you got to know a lot about network topology to know why this is super useful, but essentially the
Second Life client will connect to the Second Life servers. They will
switch roles on that existing TCP channel.
All of the sudden, the Second Life server can issue a request
to the client and say, "Get me some information." So now the client,
your client running on your desktop, is now a server. And the server
on the other end is now the client. So they sort of switch directions.
So I got flashed the 10-minute sign, and at this point, I'd
like to pop up Firefox, go to a demo webpage, show you how you can use
it to inspect the HTTP headers and show you how you can use it to get a
little neat graphical representation of the multiple HTTP requests that
come down when one particular webpage is rendered.
And then if we have some more time after that, we'll come back
and talk about more of the esoteric HTTP headers such as the keep-alive.
So, who's got a webpage they want us to demo here?
Audience 5:
Needforfeed.com.
Jason Woodward: Everybody OK
with
that? All right.
Yes, I
know. I know it mostly because I did a workshop on jQuery and AJAX,
and I was going to build something like that in that workshop. And I
decided as soon as I heard that it came out that I couldn't copy that.
I had to do something different. So we did something different.
So, I'm going to pop up the little Firebug tool down here. I'm
going to click on the net tab. Click on 'all'. Then we're going to hit
'enter'. Oh, fail.
Well, here's what happened. I hit Needforfeed.com. You saw
that the Firebug pop-up is here? It did get a response, which was a 302
re-direct to this other website, which does not have Firebug activated
on it. So, we're going to pop this up here, and we're going to refresh
this page so we can see something a little bit more entertaining.
Lots of requests going on here. If we ran YSlow on this, they
would have a field day.
[Laughter]
Jason Woodward: No,
seriously, it's
a great site. It's an awesome idea. What's that? Oh, OK. Well, then,
it's awful.
[Laughter]
Jason Woodward: No, I'm just
kidding. So, initial request. Firebug is telling us we've requested
/needforfeed/--or actually /informatic/needforfeed. Firebug sort
of shortens this here. I'll show you another tool in a second that
gives you the whole thing.
Response, 200. OK. Here's the HTML. This
little bar over here represents the time start the request to the
time it was done rendering the... I think it was actually done parsing
the dump,
which is the very technical thing that's the HTML behind this document.
This represents the time at which it was done downloading. And
these other bars represent when these particular dependent documents,
such as a CSS and Javascript, when the buzzer began downloading them
and when it finished. So you can kind of see here visually that these
four documents are probably referenced inside needforfeed.js because
the browser doesn't start getting them until it's done interpreting
that particular file.
If you're really big into site speed optimization, there are
techniques that you can use to parallelize all of those downloads a lot
earlier so the page will render faster.
So, you can also see here that the response for this particular file was '304 not modified'. That meant that the response was already in my cache. The headers that it issued on that request--cookies, referrals, these are all things you've heard about before--if modified since... It says, "If modified since September 10th," because my browser had it in its cache because I've already been to this site.
The server responds... Oops. Oh, it's right, Firebug puts
it
up here. The server responds '304 not modified', submitted data about
it. But really it's saying you don't actually have to download the
entire
content anymore. You can just read it out of your cache.
If I hit 'shift refresh' on this page, browser-dependent,
browser-dependent shift on Firefox says, "Ignore your local cache.
Get everything." And you can see down here, everything is becoming a
200 because
the browser is not issuing 'if modified since requests'.
This is a pretty handy tool. If you're doing any front-end web
development, this is indispensable. If you are a content person, it's
still pretty handy for some of the visual representation of what your
dependencies are in here. So I recommend getting it if you don't
already have it.
Last thing we're going to do, before we open it up for more
questions, is...just show you another plug-in called Live HTTP Headers.
And we're going to clear this and we're going to go to Google because
they've got less stuff.
So we've hit the Google front page here. Here you see what
actually goes over the wire with the exception of the response bodies.
So there is a get of a slash. My browser is issued all kinds of
different headers including a cookie, a user agent. Google web server
comes back and says, "OK, expires negative one." Huh.
Probably translates
into December 31st, 1969. You can ask me later why if you
don't know that. Identifies itself with the web server, gives you a
content
length, which identifies how long the body is, in this case it's the
HTML for the Google front page, and it is 3500 bytes.
This tool doesn't show you the HTML content of the response
body. It also identifies the content type, which is metadata saying,
"Here's how you're supposed to interpret what's in this response." The
browser does not look at file name endings, completely ignores that,
except for IE. IE will do that sometimes.
It actually is supposed to be looking at the content type of the response. In this case, it says it's HTML. In this case, it's an image. It's got a 'last modified' of June 2006. It's got 'expires' of looks like roughly a year and a day from now.
So this happens every request. Always back and forth. It's
really fast, isn't it?
Two key reminders. Each HTTP transaction is stateless,
meaning
when a browser issues a request, the server must only use the contents
of that request and any of its internal programming to compute the
value of the response. In other words, it doesn't remember that you
made this request two days ago as well or you just made another request
for these other documents. It looks at each of these requests
independently from one another.
You may say, "Well, how do sites remember I've been logged
in?" Well, they use cookies. Cookies are part of the request. But the
point there is that these HTTP transactions are completely independent
of one another.
Second, the RFC defines how compliant HTTP user agents and
servers communicate with one another. There's no governing body that
says if you want to make a web browser or a server that doesn't
actually speak HTTP correctly, in practice, though, the internet would
not work if you did not.
And with that, I've already done demos, asked for the URL,
asked for questions, my time is up. I'm happy to answer any more
questions you might have.
[Applause]
Jason Woodward: Thank you.