# Scrapy-Xpath

### &#x20;标签\[1]与extract\_first()

标签\[1]的覆盖范围更广泛,比如

```python
//div[@class=title]/a[1]   #所有div标签中的第一个a
//div[@class=title]/a.extract_first()  #所有的div标签下的a标签构成的列表的第一个元素
```

### &#x20;as=response.xpath(//\*/div\[@class="guide"])

&#x20;xpath返回的迭代器有三个部分

```python
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
```

* `extract()`函数则是把其中的data部分全部提取出来，是一个列表；
* `extract_first()`返回符合条件的第一个数据；
* `as=response.xpath('//*/div[@class="guide"]/text()').extract()`加入text后，可以去掉元素的标签
* 全部孩子节点“//”；
* 直接孩子节点“/”
* '\*' 返回全部符合条件的节点进入列表。

### &#x20;利用for循环，对上述符合条件的节点进行筛选

```python
# 在使用extract之前才可以继续筛选
as=response.xpath('//*/div[@class="guide"]')
for a in as：
     a.xpath("p/text()")
```

### &#x20;re模块的替换

&#x20;`re.sub(正则表达式，替换成的部分，字符串)`；

sub是贪心算法，会匹配到最长的模式，在正则表达式中加入”？“则使用最短匹配方式；寻找最近的<>之间来替换

&#x20;`a = re.sub(r'<.*?>', '', a)`

### &#x20;xpath选择某标签的属性值

```python
answer_xpath.xpath("center/img/@src").extract()
```

### &#x20;打开json文件必须在同一目录

```python
with open("C:\\Users\\daiyifan\\pclady_wiki\\pclady_wiki\\wiki_url.json","r",encoding="utf-8") as js:
    wiki_url=json.load(js)
```

### &#x20;选取属于 bookstore 子元素的最后一个 book 元素

&#x20;`/bookstore/book[last()]`

### &#x20;选取带有属性的div节点

&#x20;`/div[@*]`

### 分层次过滤节点

```python
as=response.xpath("//*/div/p")
for a in as:
     i=a.xpath("span").extract()
     j=a.xpath("@href").extract()
```

### &#x20;元素的兄弟节点

&#x20;`following-sibling::p #后兄弟节点P，属于xpath的轴`


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://m0uk4.gitbook.io/notebooks/honey/python-notes/scrapy/scrapy-xpath.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
