February 4th, 2012 | Tags: , , ,

Because of some personal reasons, I need to export some of my DM (tens of thousands lines) in Sina weibo (AKA the Chinese twitter). Weibo, the major NON-international micro-blog platform have more than 250M users (including DUMMY/ROBOT/BRAIN-EATING users) now. Such Chinese web companies, mostly relying on marketing effort and a good relationship with government, generally don’t put too much heart on the user experience. It’s even worse for power users, also much harder to opt-out than opt-in, so obviously there is no official export button to expect. Google only pop up several related links but none of which deal with DM, so I have to fight myself.

The first thing get into my mind is the official API. It does have some DM related API, but can only get the “latest” message list. Not sure it’s possible to call it multiple times to pull everything out. I just give up, since to use the API you have to register as a developer (with your real name!) and wait for sina’s authorization. Too hard to trust sina, next.

Then the brutal force solution: use a macro recorder. Google it and it looks like web scraping is so popular, that there is a software category called internet macro recorder. Try one of them, the trial version is too stupid to know how to click the next page button and then save the html (still need to be parsed later). Too dangerous to buy the full version, in case it’s still very stupid. Next.

Then the python urllib with which I’ve written something to fetch singer list from baidu music some time ago. However, even after steal the logon dark magic from a robot script, successfully download the message history html, I end up with a bunch of more dark magic <script>…</script>. Oh my, I should have known better that the good old days of static html is long gone with the wind. Sadly, next.

Here comes the final shot. After asking the omnipotent stackoverflow, there are several suggestions. One is to use WebQt’s python binding which looks pretty promising but has a stiff learning curve of Qt’s hundreds of API and its python version. Or to use javascript based Phantomjs. But I’m not familiar with javascript and know nothing about its environment.

Finally I decided to try Selenium, a “browser automation framework”, said on its website. Looks like these tools come out of the furious browser fight raised up by google. Chrome uses lots of automatic test to torture their browser with various websites. Back to selenium, it’s a really powerful tool which can easily find all kinds of DOM element and manipulate it. In my case, it’s just go to the message history webpage and click next next next, until there is no next (or previous if you want chronicle order). The script should be pretty straight forward to understand with the comments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# -*- coding: utf-8 –*-
# A sina weibo DM export tool
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
import time
import codecs
 
f = codecs.open('weibo.txt', encoding='utf-8', mode='w+')
 
# Create a new instance of the Chrome
driver = webdriver.Chrome()
# Set 15 sec as default timeout (maximum waiting time if something can't be found)
driver.implicitly_wait(15)
# go to the direct message history page (for DM with one user)
driver.get(&quot;http://weibo.com/message/history?uid=xxxxxxxxxx&quot;)
 
# Find loginname input box
loginnameInput = driver.find_element_by_id(&quot;loginname&quot;)
loginnameInput.send_keys(&quot;me@mydomain.com&quot;)
# Find password input box
passwdInput = driver.find_element_by_id(&quot;password&quot;)
passwdInput.send_keys(&quot;mypasswd&quot;)
# Find the submit button
submitButton = driver.find_element_by_id(&quot;login_submit_btn&quot;)
# Submit
submitButton.click()
 
n = 1
more = 1
while more:
    # Find message box
    messages = driver.find_elements_by_class_name(&quot;txt&quot;)
    # Find time tag box
    ts = driver.find_elements_by_css_selector(&quot;em.W_textb.date&quot;)
    f.write(ts[-1].text + &quot;\n&quot;)
    for msg in reversed(messages):
        if (msg.text != ''):
            f.write(msg.text + &quot;\n&quot;)
    f.flush()
    buttons = driver.find_elements_by_class_name(&quot;W_btn_a&quot;)
    more = 0
    for button in buttons:
        # Next page or previous page
        if button.text == u'上一页':
            more = 1
            break
    if more:
        n += 1
        print 'Page %d' % n
        button.click()
        time.sleep(2)
f.close()
print 'All Done!'
driver.quit()

I’m using the chrome as the webdriver so I can easily F12 or do “inspect element” to fire up the webpage debug console to reveal anything with ease. The script can be polished to support incremental export, and also auto detect users. But so far so good, I just claim victory and leave it as it is.

March 14th, 2010 | Tags:

I used to have some elisp code stolen from somewhere which copy the current line. Then I find most of time I don’t want the prefix whitespaces copied, so I modified it a bit as a tiny try to cure my lisp-parenthesis-horror.

(defun copy-line (&amp;optional arg)
       "Save current line from the first non-whitespce character into Kill-Ring without mark the line "
       (interactive "P")
       (let ((beg (progn (back-to-indentation) (point)))
             (end (line-end-position)))
         (copy-region-as-kill beg end))
       )
(global-set-key (kbd "C-c l") (quote copy-line))

It seems that lisp is not as daunting as haskell…

June 12th, 2008 | Tags: ,
May 3rd, 2008 | Tags:

I have to admit that when I first saw this quote from Ted‘s blog on Organic vs. Non-organic Open Source, I was quite confused (another GRE reading comprehension?). So I follow the link and read the original text. It’s a long email in 1996 from Bryan Cantrill, a Solaris Engineer to David Miller, a core developer working on sparc related linux kernel. Most of the email is quoted in which David talked in detail about some inefficient kernel mechanisms of Solaris compared with Linux. Then, at the end of the email comes the most amazing part, unquoted, from Sun-based Bryan

Have you ever kissed a girl?

What the heck!

Anyway, it’s an old story happened 12 years ago. Two(?) years ago, Sun published OpenSolaris. Four months ago, they aquired MySQL, one of the most successful community-driven open source projects. However, as a commercial company, Sun doesn’t seem to truly embrace the open source camp, instead it’s more like they just want to take some advantage. OpenSolaris is slow in development progress, and there are complains. For MySQL, they announed some of the new features will not be available for the open source version. Open source people are suspicious about Sun’s intent. Personally I think Sun will not be able to get what they want, unless they change their closing-too-long-time mind.

These days, open source projects like kernel, apache, Firefox, MySQL are everywhere. People are using them (say browsing internet) all the time even if they are not aware of their existense. For all these great projects, I likes Qt a lot, powered by Troll Tech and of course KDE community. I see Qt as a successful business model to make everybody happy. A company who’s employing people to improve the code constantly, a bunch of developers from open source community who contribute and give feedbacks and suggestions, and end users who get most out of the combination of the two.  There are always suspicion on Qt’s license issue (that’s why there are Gnome ;) ), especially after Troll Tech acquired by Nokia. But it seems they handle these quite well. As a result of the effort from both commercial company and open source community, KDE 4.1 is coming up with the bleeding-edge technology very soon. And thanks to it, I get a slot from this year’s Google Summer of Code to improve KGet, a opensource downloader based on KDE/Qt.

Economically, Open source projects are actually introduce more competence into the market. Coz once the source code itself is available to everybody, company who want to make money have to concentrate on service. The end users immideately have bunch of choises instead of been bounded to the sole company owning the source (M$?). Actually they can even hire their own people to work on it if there is a need! More competence may mean less cost which is a good thing.

There are also people who disagree with the open source trend. The professor who taught me advanced C++ once gave his opinion on open source trend in class. He saw open source as something directly opposed to the get-most-money-if-you-can moto held by commerical companies. He even had a quite conspiracy-theroy-like opinion that these industrial giants like IBM, Nowell are backing up open source projects merely as a kind of power to undermine Microsoft. At last he concluded, since so many smart people are contributing their brain power to the open source trends, it’s hard to forsee what will happen.

Having said all these, as a non-native-speaker of English, I would still like to know what’s the meaning between lines for the sentense ‘Have you ever kissed a girl?’ at that context. Is that something like ‘You nerd, have you ever got a life?’

Comments are most welcome.

April 17th, 2008 | Tags:

This will be my second blog which is in English.

TOP