Because of some personal reasons, I need to export some of my DM (tens of thousands lines) in Sina weibo (AKA the Chinese twitter). Weibo, the major NON-international micro-blog platform have more than 250M users (including DUMMY/ROBOT/BRAIN-EATING users) now. Such Chinese web companies, mostly relying on marketing effort and a good relationship with government, generally don’t put too much heart on the user experience. It’s even worse for power users, also much harder to opt-out than opt-in, so obviously there is no official export button to expect. Google only pop up several related links but none of which deal with DM, so I have to fight myself.
The first thing get into my mind is the official API. It does have some DM related API, but can only get the “latest” message list. Not sure it’s possible to call it multiple times to pull everything out. I just give up, since to use the API you have to register as a developer (with your real name!) and wait for sina’s authorization. Too hard to trust sina, next.
Then the brutal force solution: use a macro recorder. Google it and it looks like web scraping is so popular, that there is a software category called internet macro recorder. Try one of them, the trial version is too stupid to know how to click the next page button and then save the html (still need to be parsed later). Too dangerous to buy the full version, in case it’s still very stupid. Next.
Then the python urllib with which I’ve written something to fetch singer list from baidu music some time ago. However, even after steal the logon dark magic from a robot script, successfully download the message history html, I end up with a bunch of more dark magic <script>…</script>. Oh my, I should have known better that the good old days of static html is long gone with the wind. Sadly, next.
Here comes the final shot. After asking the omnipotent stackoverflow, there are several suggestions. One is to use WebQt’s python binding which looks pretty promising but has a stiff learning curve of Qt’s hundreds of API and its python version. Or to use javascript based Phantomjs. But I’m not familiar with javascript and know nothing about its environment.
Finally I decided to try Selenium, a “browser automation framework”, said on its website. Looks like these tools come out of the furious browser fight raised up by google. Chrome uses lots of automatic test to torture their browser with various websites. Back to selenium, it’s a really powerful tool which can easily find all kinds of DOM element and manipulate it. In my case, it’s just go to the message history webpage and click next next next, until there is no next (or previous if you want chronicle order). The script should be pretty straight forward to understand with the comments.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | # -*- coding: utf-8 –*- # A sina weibo DM export tool from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0 import time import codecs f = codecs.open('weibo.txt', encoding='utf-8', mode='w+') # Create a new instance of the Chrome driver = webdriver.Chrome() # Set 15 sec as default timeout (maximum waiting time if something can't be found) driver.implicitly_wait(15) # go to the direct message history page (for DM with one user) driver.get("http://weibo.com/message/history?uid=xxxxxxxxxx") # Find loginname input box loginnameInput = driver.find_element_by_id("loginname") loginnameInput.send_keys("me@mydomain.com") # Find password input box passwdInput = driver.find_element_by_id("password") passwdInput.send_keys("mypasswd") # Find the submit button submitButton = driver.find_element_by_id("login_submit_btn") # Submit submitButton.click() n = 1 more = 1 while more: # Find message box messages = driver.find_elements_by_class_name("txt") # Find time tag box ts = driver.find_elements_by_css_selector("em.W_textb.date") f.write(ts[-1].text + "\n") for msg in reversed(messages): if (msg.text != ''): f.write(msg.text + "\n") f.flush() buttons = driver.find_elements_by_class_name("W_btn_a") more = 0 for button in buttons: # Next page or previous page if button.text == u'上一页': more = 1 break if more: n += 1 print 'Page %d' % n button.click() time.sleep(2) f.close() print 'All Done!' driver.quit() |
I’m using the chrome as the webdriver so I can easily F12 or do “inspect element” to fire up the webpage debug console to reveal anything with ease. The script can be polished to support incremental export, and also auto detect users. But so far so good, I just claim victory and leave it as it is.
