当前位置：首页 > news >正文

九江市网站建设_网站建设公司_HTTPS_seo优化

news 2025/12/2 6:18:35

网站建设制作有那些,响应式购物网站,重庆网红景点有哪些,网站建设的主要功能及定位有了前面几节课的准备#xff0c;我们这一次终于可以真刀真枪的干一场大的了#xff0c;但是呢#xff0c;在进行实战之前#xff0c;我们还要讲讲正则表达式的实用方法和扩展语法#xff0c;然后再来实战#xff0c;大家多把持一会啊。我们先来翻一下文档#xff1a;…有了前面几节课的准备我们这一次终于可以真刀真枪的干一场大的了但是呢在进行实战之前我们还要讲讲正则表达式的实用方法和扩展语法然后再来实战大家多把持一会啊。我们先来翻一下文档首先我们要举的例子是讲得最多的 search() 方法search() 方法既有模块级别的就是直接调用 re.search() 来实现另外编译后的正则表达式模式对象也同样拥有 search() 方法我问问大家它们之间有区别吗如果你的回答仅仅是模块级别的search() 方法比模式级别的search() 方法要多一个正则表达式的参数那你肯定没有去翻文档。 re.search(pattern, string, flags0) Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string. 这是模块级别的 search() 方法大家注意它的参数它有一个 flags 参数 flags 参数就我们上节课讲得编译标志位作为一个模块级别的它没办法复印它直接在这里使用它的标志位就可以了。 pattern 是正则表达式的模式 string 是要搜索的字符串我们再来看一下如果是编译后的模式对象它的 search() 方法又有哪些参数 regex.search(string[, pos[, endpos]]) Scan through string looking for the first location where this regular expression produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string. The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the ^ pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start. The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found; otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0). 前面的 pattern 模式对象的参数就不需要了。 string 第一个参数就是待搜索的字符串后面有两个可选参数是我们模块级别的 search() 方法没有的它分别代表需要搜索的起始位置pos和结束位置endpos 你就可以像 rx.search(string, 0, 50) 或者 rx.search(string[:50], 0) 这样子去匹配它的搜索位置了。还有一点可能被忽略的就是search() 方法并不会立刻返回你所需要的字符串取而代之它是返回一个匹配对象。我们来举个例子 import re result re.search(r (\w) (\w), I love Python.com) result _sre.SRE_Match object; span(1, 13), match love Python 我们看到这个 result 是一个匹配对象 match object.而不是一个字符串。它这个匹配对象有一些方法你使用这些方法才能够获得你所需要的匹配的字符串例如group()方法 result.group() love Python 我们就把匹配的内容打印出来了。首先是一个空格然后是 \w 就是任何字符这里就是love然后又是一个空格然后又是 \w这里就是Python。说到这个 group()方法值的一提的是如果正则表达式中存在着子组子组会将匹配的内容进行捕获通过这个 group()方法中设置序号可以提取到对应的子组序号从1开始捕获的字符串。例如 result.group(1) love result.group(2) Python 除了 group()方法之外它还有 start()方法、end()方法、 span() 方法分别返回它匹配的开始位置、结束位置、范围。 match.start([group]) match.end([group]) Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring). Return -1 if group exists but did not contribute to the match. For a match object m, and a group g that did contribute to the match, the substring matched by group g (equivalent to m.group(g)) is m.string[m.start(g):m.end(g)]Note that m.start(group) will equal m.end(group) if group matched a null string. For example, after m re.search(b(c?), cba), m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both 2, and m.start(2) raises an IndexError exception. An example that will remove remove_this from email addresses: email tonytiremove_thisger.netm re.search(remove_this, email)email[:m.start()] email[m.end():] tonytiger.netmatch.span([group]) For a match m, return the 2-tuple (m.start(group), m.end(group)). Note that if group did not contribute to the match, this is (-1, -1). group defaults to zero, the entire match. result.start() 1 result.end() 13 result.span() (1, 13) 接下来讲讲 findall() 方法 re.findall(pattern, string, flags0) Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match. 有人可能会觉得findall() 方法很容易不就是找到所有匹配的内容然后把它们组织成列表的形式返回吗。没错这是在正则表达式里没有子组的情况下所做的事如果正则表达式里包含了子组那么findall() 会变得很聪明。我们来举个例子吧上贴吧爬图例如我们想下载这个页面的所有图片贴吧404 我们先来踩点看到图片格式的标签我们就来直接写代码啦首先我们写下下面的代码来爬取图片地址 import re p rimg classBDE_Image src[^]\.jpg imglist re.findall(p, html) for each in imglist: print(each) 打印的结果为 RESTART: C:\Users\XiangyangDai\Desktop\tieba.py img classBDE_Image srchttps://imgsa.baidu.com/forum/w%3D580/sign65ac7c3d9e0a304e5222a0f2e1c9a7c3/4056053b5bb5c9ea8d7d0bdadc39b6003bf3b34e.jpg img classBDE_Image srchttps://imgsa.baidu.com/forum/w%3D580/signd887aa03394e251fe2f7e4f09787c9c2/77f65db5c9ea15ceaf60e830bf003af33b87b24e.jpg img classBDE_Image srchttps://imgsa.baidu.com/forum/w%3D580/sign0db90d472c1f95caa6f592bef9167fc5/2f78cfea15ce36d34f8a8b0933f33a87e850b14e.jpg img classBDE_Image srchttps://imgsa.baidu.com/forum/w%3D580/signabfd18169ccad1c8d0bbfc2f4f3f67c4/bd2713ce36d3d5392db307fa3387e950342ab04e.jpg 很显然这不是我们需要的地址我们需要的只是后面的部分。我们接下来要解决的问题就是如何将里面的地址提取出来不少人听到这里可能就已经开始动手了。但是别急我这里有更好的方法。只需要把图片地址用小括号括起来即将 p rimg classBDE_Image src[^]\.jpg 改为 p rimg classBDE_Image src([^]\.jpg) 大家再来看一下运行后的结果 RESTART: C:\Users\XiangyangDai\Desktop\tieba.py https://imgsa.baidu.com/forum/w%3D580/sign65ac7c3d9e0a304e5222a0f2e1c9a7c3/4056053b5bb5c9ea8d7d0bdadc39b6003bf3b34e.jpg https://imgsa.baidu.com/forum/w%3D580/signd887aa03394e251fe2f7e4f09787c9c2/77f65db5c9ea15ceaf60e830bf003af33b87b24e.jpg https://imgsa.baidu.com/forum/w%3D580/sign0db90d472c1f95caa6f592bef9167fc5/2f78cfea15ce36d34f8a8b0933f33a87e850b14e.jpg https://imgsa.baidu.com/forum/w%3D580/signabfd18169ccad1c8d0bbfc2f4f3f67c4/bd2713ce36d3d5392db307fa3387e950342ab04e.jpg 是不是很兴奋是不是很惊讶先别急我先把代码敲完再给大家讲解。 import urllib.request import re def open_url(url): req urllib.request.Request(url) req.add_header(User-Agent, Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36) response urllib.request.urlopen(url) html response.read() return html def get_img(url): html open_url(url).decode(utf-8) p rimg classBDE_Image src([^]\.jpg) imglist re.findall(p, html) for each in imglist: print(each) for each in imglist: filename each.split(/)[-1] urllib.request.urlretrieve(each, filename, None) if __name__ __main__: url https://tieba.baidu.com/p/4863860271 get_img(url) 运行结果就是很多美眉图片出现在桌面了前提是这个程序在桌面运行图片自动下载到程序所在文件夹。接下来就来解决大家的困惑了为什么加个小括号会如此方便呢这是因为在 findall() 方法中如果给出的正则表达式是包含着子组的话那么就会把子组的内容单独给返回回来。然而如果存在多个子组那么它还会将匹配的内容组合成元组的形式再返回。我们还是举个例子因为有时候 findall() 如果使用的不好很多同学就会感觉很疑惑很迷茫…… 拿前面匹配 ip 地址的正则表达式来讲解我们使用 findall() 来尝试自动从https://www.xicidaili.com/wt/获取 ip 地址初代码如下 import urllib.request import re def open_url(url): req urllib.request.Request(url) req.add_header(User-Agent, Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36) reponse urllib.request.urlopen(req) html reponse.read() return html def get_ip(url): html open_url(url).decode(utf-8) p r(([0,1]?\d?\d|2[0-4]\d|25[0-5])\.){3}([0,1]?\d?\d|2[0-4]\d|25[0-5]) iplist re.findall(p, html) for each in iplist: print(each) if __name__ __main__: url https://www.xicidaili.com/wt/ get_ip(url) 运行结果如下 RESTART: C:\Users\XiangyangDai\Desktop\getIP.py (180., 180, 122) (248., 248, 79) (129., 129, 198) (217., 217, 7) (40., 40, 35) (128., 128, 21) (118., 118, 106) (101., 101, 46) (3., 3, 4) 得到的结果让我们很迷茫为什么会这样呢这明显不是我们想要的结果这是因为我们在正则表达式里面使用了 3 个子组所以findall() 会自作聪明的把我们的结果做了分类然后用元组的形式返回给我们。那有没有解决的方法呢要解决这个问题我们可以让子组不捕获内容。我们查看 - Python3 正则表达式特殊符号及用法详细列表,寻求扩展语法。让子组不捕获内容扩展语法就是非捕获组所以我们的初代码修改如下 import urllib.request import re def open_url(url): req urllib.request.Request(url) req.add_header(User-Agent, Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36) reponse urllib.request.urlopen(req) html reponse.read() return html def get_ip(url): html open_url(url).decode(utf-8) p r(?:(?:[0,1]?\d?\d|2[0-4]\d|25[0-5])\.){3}(?:[0,1]?\d?\d|2[0-4]\d|25[0-5]) iplist re.findall(p, html) for each in iplist: print(each) if __name__ __main__: url https://www.xicidaili.com/wt/ get_ip(url) 运行得到的结果也是我们想要的 ip 地址了如下 RESTART: C:\Users\XiangyangDai\Desktop\getIP.py 183.47.40.35 61.135.217.7 221.214.180.122 101.76.248.79 182.88.129.198 175.165.128.21 42.48.118.106 60.216.101.46 219.245.3.4 117.85.221.45 接下来我们又回到文档另外还有一些使用的方法例如 finditer() 是将结果返回一个迭代器方便以迭代方式获取数据。 sub() 是实现替换的操作。在Python3 正则表达式特殊符号及用法详细列表中也还有一些特殊的语法例如 (?...)前向肯定断言。 (?...)前向否定断言。 (?...)后向肯定断言。 (?!...)后向肯定断言。这些都是非常有用的但是呢这些内容有点多了如果说全部都讲正则表达式的话那我们就是喧宾夺主了我们主要讲的是网络爬虫哦。所以大家还是要自主学习一下多看多学多操作。

查看全文

http://www.ihoyoo.com/news/88608.html