BeautifulSoup을 사용하여 특정 텍스트가 포함 된 HTML 태그 찾기

Nice programing

BeautifulSoup을 사용하여 특정 텍스트가 포함 된 HTML 태그 찾기

nicepro 2020. 12. 1. 19:43

BeautifulSoup을 사용하여 특정 텍스트가 포함 된 HTML 태그 찾기

다음과 같은 텍스트 패턴이 포함 된 HTML 문서의 요소를 가져 오려고합니다. # \ S {11}

<h2> this is cool #12345678901 </h2>

따라서 이전은 다음을 사용하여 일치합니다.

soup('h2',text=re.compile(r' #\S{11}'))

결과는 다음과 같습니다.

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

일치하는 모든 텍스트를 얻을 수 있습니다 (위의 줄 참조). 그러나 텍스트의 부모 요소가 일치하기를 원하므로 문서 트리를 탐색하기위한 시작점으로 사용할 수 있습니다. 이 경우 텍스트 일치가 아닌 모든 h2 요소가 반환되기를 원합니다.

아이디어?

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

인쇄물:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

BeautifulSoup 검색 작업 은 다른 경우 와 달리 기준으로 사용될 BeautifulSoup.NavigableString때 [목록] 개체를 제공 합니다. 개체의 속성을 확인하여 사용 가능한 속성을 확인하십시오. 이러한 속성 중 은 BS4 의 변경으로 인해 선호 됩니다.text=BeautifulSoup.Tag__dict__parentprevious

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

bs4 (Beautiful Soup 4)를 사용하면 OP의 시도가 예상대로 정확히 작동합니다.

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

를 반환합니다 [<h2> this is cool #12345678901 </h2>].

참고 URL : https://stackoverflow.com/questions/866000/using-beautifulsoup-to-find-a-html-tag-that-contains-certain-text

'Nice programing' 카테고리의 다른 글

캐시 VS 세션 VS 쿠키? (0)	2020.12.01
boolean [] vs. BitSet : 어느 것이 더 효율적입니까? (0)	2020.12.01
Eclipse에서 디버깅하는 동안 대화 형 최상위 수준 (일명 "디스플레이 콘솔")에 어떻게 액세스합니까? (0)	2020.12.01
Maven : 공용 저장소에서 찾을 수없는 jar 포함 (0)	2020.12.01
git을 통해 svn-Branch를 삭제 하시겠습니까? (0)	2020.12.01

현재글BeautifulSoup을 사용하여 특정 텍스트가 포함 된 HTML 태그 찾기

nicepro

BeautifulSoup을 사용하여 특정 텍스트가 포함 된 HTML 태그 찾기

BeautifulSoup을 사용하여 특정 텍스트가 포함 된 HTML 태그 찾기

'Nice programing' 카테고리의 다른 글

'Nice programing'의 다른글

티스토리툴바

BeautifulSoup을 사용하여 특정 텍스트가 포함 된 HTML 태그 찾기

BeautifulSoup을 사용하여 특정 텍스트가 포함 된 HTML 태그 찾기

'Nice programing' 카테고리의 다른 글

'Nice programing'의 다른글

관련글

티스토리툴바