read content in a text file in python

- 编辑:金沙国际平台登录 -

read content in a text file in python

** read text file in python
capability: reading =text= from a text file

Some critical information has been crawled from a website. The website is as below:

图片 1

  1. open the IDLE text editor
    >>> idle3
  2. declare a *string* variable that holds *the path to the text file*, =test.txt=
    >>> strPath="/home/kaiming/Documents/Python/text/text.dat"
  3. open the file using the =open()= function
    >>> f=open
  4. Read the contents of the file using the =read()= function
    >>> StrText=f.read()
  5. Print out the contents of the file
    >>> print

The information we need is "item title", "image", "review number", "price", and "star". The result is shown here:

图片 2

refer to

The general process for the web crawling could be described as below (from the course website) :

1) The html file could be read (r) or write (w) from open() function. There are two ways: 

(1) file = open('absolute or relative file path','r');    print(file.read());    file.close()

(2) with open('absolute or relative file path','r') as file:   print(file.read())

2) A special, unique label information (i.e., css path) should be identified in the html file. The relevant commands are: inspect and copy selector.  

2) One example of the css path looks like: 

"body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)"

    Note: "nth-child" should be changed for "nth-of-type(n)" in BeautifulSoap. 

3) The information, or css path, should be incorporated in soup.select('css path') to get the result list:

"stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')"

The "starts" is a list. 

4) In order to get a single result from the list, we could use zip() function and for "for" "in" structure, to iterate through the "zipped" lists:

"for title,image,review,price,star in zip(titles,images,reviews,prices,stars):"

5) Use get_text(), get('src'), or get("href") functions to retrieve the desired content from the tag. 

data = {

'title': title.get_text(),               # 使用get_text()方法抽出文本

'image': image.get('src'),         # 使用get 方法抽出带有src的图片链接

'review': review.get_text(),

'price': price.get_text(),

'star':len(star.find_all("span",class_='glyphicon glyphicon-star'))*'★'           

                                               # 使用find_all 总结有几处是★的体裁        

                                              # 由于find_all()再次来到的结果是列表,我们再利用len()方法去总括列表中的成分个数,也正是有限的数目


本文由编程发布,转载请注明来源:read content in a text file in python