miercuri, 22 martie 2017

Scraping the web with Python

http://bit.do/diWGs Hidden discounts at HeliumScraper.
Learn how to scrape the web with Python: https://www.datacamp.com/courses/importing-data-in-python-part-2 Wow! you have just scraped HTML data from the web and you've done so using two different packages: urllib and requests. You also saw that requests provided a higher-level interface in that you needed to write less lines of to retrieve the relevant HTML as a string. You've got the HTML of your page of interest but, generally HTML is a humble-jumble mix of both unstructured and structured data. A word on these terms: structured data is data that has a pre-defined data model or that is organized in a defined manner. unstructured data is data that does not possess either of these properties. HTML is interesting because, although much of it is unstructured text, it does contain tags that determine where, for examples, headings can be found, and hyperlinks. In general, to turn HTML that you have scraped from the world wide web into useful data, you'll need to parse it and extract structured data from it. In this video and the next few interactive exercises, we'll provide a brief introduction to how you can perform such tasks using the Python package Beautiful Soup. Lets check out the package's website: The first words at the top are: You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects. Firstly, a word on the name of the package: Beautiful Soup? In web development, the term "tag soup" refers to structurally or syntactically incorrect HTML code written for a web page. What Beautiful Soup does best is to make tag soup beautiful again and to extract information from it with ease! In fact, the main object created and queried when using this package is called BeautifulSoup and it has a very important associated method called prettify! Lets now see Beautiful Soup in Beautiful Action! Once again, you use requests to scrape the HTML from the web. Then you create a BeautifulSoup object from the resulting HTML and prettify. Printing the prettified Soup and the original HTML, you can see that for, example, the prettified Soup is indented in the way you would expect properly written HTML to be. You'll explore a few of the methods that you can apply to your soupified HTML in the following exercises, such as title and get_text, which extract the title and text, respectively. You'll also work with the Soupy method find_all in order to extract the URLs of all of the hyperlinks in the HTML. These are merely a few of many methods existing in BeautifulSoup to extract data from HTML. If, after completing these exercises, you find yourself thirsting for more BeautifulSoup, there are plenty of great resources on their website. OK: now it's your turn to jump into the deep end of the proverbial soup bowl! Happy hacking!