用Python实现应用Last-Modified和ETag避免下载重复内容

Posted on Wed 09 November 2011 in it

Http 1.1中避免重复下载的标记

使用Http1.1中定义好的头信息来避免重复下载,参考HTTP/1.1 Section 14 Header Field Definitions中的14.19 ETag/14.24 If-Match/14.29 Last-Modified/14.25 If-Modified-Since

开发者把Last-Modified 和ETags请求的http报头一起使用,能够有效利用本地缓存,降低无谓的重复下载。

示例代码逻辑

  1. 客户端下载一个链接(Sample);
  2. 服务器返回Sample,Sample中记录Last-Modified/ETag标记;
  3. 客户端再次下载这个链接,并将上次请求时服务器返回的Last-Modified/ETag一起传递给服务器;
  4. 服务器检查该Last-Modified或ETag,并判断出该页面自上次客户端请求之后还未被修改,直接返回响应304和一个空的响应体。 其实在《Dive Into Python》中就有相当详细的实例代码,强烈建议没看过这本书的python程序员们认真学习一下,会提升面向对象编程和网络编程能力的。

示例代码

#!/usr/bin/python
# -*- coding: utf-8 -*-
'''
Created on Nov 9, 2011

@author: li3huo
'''

import urllib, urllib2
import sys
import time
class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):
    """用来保证请求中记录Http状态
"""
    def http_error_default(self, req, fp, code, msg, headers):
        result = urllib2.HTTPError(
            req.get_full_url(), code, msg, headers, fp)
        result.status = code
        return result

class Sample():
    """a sample is the url i want to download
"""
    url = None
    contentLength = 0
    etag = None
    lastModified = None
    data = None
    path = None

    def __init__(self, url, contentLength=0, etag=None, lastModified=None):
        self.url = url
        self.contentLength = 0
        self.etag = etag
        self.lastModified = lastModified
        self.status = 200
        self.file = file
    def __repr__(self):
        return repr("Http Status=%d; Length=%d; Last Modified Time=%s; eTag=%s" % (self.status, self.contentLength, self.lastModified, self.etag))

    def downloadSample(self):
        request = urllib2.Request(self.url)
        request.add_header('User-Agent', "Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en)")
        if self.lastModified:
            request.add_header('If-Modified-Since', self.lastModified)
        if self.etag:
            request.add_header('If-None-Match', self.etag)
        conn = urllib2.build_opener(DefaultErrorHandler()).open(request)


        if hasattr(conn, 'headers'):
            # save ETag, if the server sent one
            self.etag = conn.headers.get('ETag')
            # save Last-Modified header, if the server sent one
            self.lastModified = conn.headers.get('Last-Modified')

            self.contentLength = conn.headers.get("content-length")

        if hasattr(conn, 'status'):
            self.status = conn.status
            print "status=%d" % self.status

        self.data = conn.read()

        if self.status == 304:
            print "the content is same, so return nothing!"

        if not self.contentLength:
            self.contentLength = len(self.data)

        conn.close()

    if __name__ == '__main__':
    url = 'http://www.sina.com.cn'
    sample = Sample(url)
    sample.downloadSample()
    print sample
    sample.downloadSample()
    print sample

输出结果

'Http Status=200; Length=589988; Last Modified Time=Wed, 09 Nov 2011 10:45:55 GMT; eTag=None' status=304 the content is same, so return nothing! 'Http Status=304; Length=0; Last Modified Time=Wed, 09 Nov 2011 10:45:55 GMT; eTag=None'