From:Bill Sconce Subject:[GNHLUG] PySIG this Thursday - Beautiful Soup - HTML scraping; actual code developed before your eyes Date:Monday, October 22, 2007 11:02:25 PM PySIG Manchester, NH 25 October 2007 ------------------------------------------------------------------------ Kent Johnson: Beautiful Soup Us Ourselves: Python in Action ------------------------------------------------------------------------ ____________________________________________________________________ PySIG -- New Hampshire Python Special Interest Group Amoskeag Business Incubator, Manchester, NH 25 October 2007 (4th Thursday) 7:00PM The monthly meeting of PySIG, the NH Python Special Interest Group, takes place on the fourth Thursday of the month, starting at 7:00 PM. Beginners' session precedes at 6:30 PM. (Bring a Python question!) -------------------------------------------------------------------- Kent's Korner - Kent Johnson: Beautiful Soup -------------------------------------------------------------------- "Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful: 1. Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away. 2. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit fordissecting a document and extracting what you need. You don't have to create a custom parser for each application. 3. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding. "Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it 'Find all the links', or 'Find all the links of class externalLink', or 'Find all the links whose urls match "foo.com"', or 'Find the table heading that's got bold text, then give me that text.' "Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup." -------------------------------------------------------------------- 1st-ever PySIG development sprint: we try to write Actual Code -------------------------------------------------------------------- Per a Challenge from Ted Roche. Viz, "Recently, I started messing with some of the data we have stored on GNHLUG.org, specifically, the Past Events page: http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents "Just like an end user, I had some questions, simple to ask, tough to answer. We have attendance data for most meetings since September of 2005. Given those dates to present, 1. What's the average monthly attendance at a GNHLUG event? 2. What's the average attendance over that period per group/chapter/SIG? 3. What's the most popular meeting? Most popular for each group? 4. What's the trends in attendance? Up, down? "It seems like an interesting real-world problem: scrape a web page of questionable HTML, interpret dirty data (not all groups are groups, not all atttendance numbers are numbers), dump it into a database (or perhaps a spreadsheet?) and do the calculations. Pretty graphs get extra points. "I think the results would be interesting, and the process of getting to the results interesting, too: presenting how you take on the problem, what tools you use, how much code is needed, would make a fun meeting not only inside the SIG, but to LUG meetings as well." --posted to PySIG mailing list 26 Sept 07 And the group thought so too... so here we go! We'll throw code up on the screen, develop the screenscraper Ted envisions (or as much of it as we can get done in the available time) and publish the results. A real world test of RAD, as empowered by Python (and its "batteries-included" libraries -- in this case Beautiful Soup). All are welcome. Come to help us code. Or come to laugh :) Plus: ------------------------------------------------------------------- o Kent's Korner The Real Stuff! - Kent Johnson This month: Beautiful Soup (see above) Upcoming Kent's Korner topics: XML parsing (ElementTree) Profiling (timeit, prof) o Our usual roundtable of introductions, happenings, announcements o Gotcha contest - Got a favorite "gotcha"? Bring it and share... And of course, milk & cookies. Cookies are assured, thanks to Janet. Milk also, thanks to Alex. ------------------------------------------------------------------- 6:30 Beginners' Q&A 7:00 Welcome, Announcements - Bill & Ted & Alex 7:10 Milk & Cookies - Alex & Janet 7:10 Favorite-gotcha contest 7:15 Kent's Korner (Python Module of the Month) - Beautiful Soup 7:45 Development Sprint! Web-Scraping; the Ted Roche Challenge 9:00~ Adjourn ___________________________________________________________________ About PySIG: PySIG meetings are typically 10-20 people, around a large table equipped with a projector and Internet hookups (wired and wireless). We encourage laptops and a hands-on seminar style. The main meeting starts at 7 PM; officially we finish circa 9 PM. Everyone is welcome. ("Membership" is anyone who has an interest in the Python progamming language, whether on Microsoft systems or Linux or OS X; or cell phones, mainframes, or space stations. We have everyone from object-oriented gurus to recovering COBOL programmers.) Tell your friends! Beginners' session: The half hour before the formal meeting (i.e., starting at 6:30PM) we have a beginners' session. Any Python question is welcome -- whoever asks the first question gets the half hour! Questions are equally welcome by mail beforehand (in which case we can announce them) or at the meeting. (As are all Python questions, anytime.) Mailing list: http://www.dlslug.org/mailman/listinfo/python-talk About Python: "Python is a dynamic object-oriented programming language that can be used for many kinds of software development. It offers strong support for integration with other languages and tools, comes with extensive standard libraries, and can be learned in a few days. Many Python programmers report substantial productivity gains and feel the language encourages the development of higher quality, more maintainable code." "NASA uses Python...so does Rackspace, Industrial Light&Magic, AstraZeneca, Honeywell, and many others." Google: "Python has been an important part of Google since the beginning, and remains so as the system grows and evolves." -Peter Norvig http://www.python.org About Amoskeag Business Incubator: Our gracious hosts are the Amoskeag Business Incubator, an organization providing a supportive entrepreneurial environment that stimulates the growth of businesses to ensure economic vitality and encourage job creation, by providing affordable office space and technical assistance to early stage companies. PySIG thanks the ABI for their generous hospitality. http://www.abi-nh.com _______________________________________________________________________ Directions (thanks to Ted Roche for improvements to "from the north"): PySIG NH meetings are held at the Amoskeag Business Incubator, 33 South Commercial Street, Manchester, NH. Coming in to Manchester using I-293, from the north: o Use Exit 6 from I-293. Stay to the right on the ramp, yield twice to traffic incoming from the left, cross back over I-293 and accept one merge coming in from your right. o Then get in the right lane, and stay there, over the river, and onto the Canal Street exit ramp. o Take the first right off Canal Street onto North Commercial Street. Enjoy the scenic mill buildings as the street turns into Commercial Street. o Coming to the traffic light get in the middle lane. South Commercial Street starts on the other side of the light. You go straight through (and join the folks coming from the south at step * below). Coming in to Manchester using I-293, from the south: o Use the Granite Street exit. Turn right (east). Go under I-293 and cross the bridge over the Merrimack River. o Turn right (south) at the first light after crossing the bridge. * This is South Commercial Street. Go past one parking-lot entrance, turn right into the second one. 33 Commercial Street will be right in front of you. You may go in via either the ramp or the door and three steps inside. o Inside. Up the stairs if via the door. Go through the glass doors - follow the diamonds on the floor. Go left at the last diamond. (Under a sign which says "<- Amoskeag Small Bus. Incubator"). o More diamonds, another sign... much glass and office space for SNHU; turn left there, 4 more diamonds and you're at the glass doors for the Incubator. An "abi" sign is above. o Through the doors, straight down the hall. The ABI Conference Room is on the left. ________________________________________________________________________ $URL: svn://svn.in-spec-inc.com/isi/trunk/isi/opages/pysig.announcement $ $Id: pysig.announcement 1570 2007-10-23 02:44:57Z sconce $ $Rev: 1570 $ _______________________________________________ gnhlug-announce mailing list gnhlug-announce@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-announce/