There are NOT millions of Twitter users in China. Here's proof
Originally posted by Jason Q. Ng at Blocked on Weibo, republished with permission.
The question of how many Chinese Twitter users there are made headlines a few months back when the market research company GlobalWebIndex published results from a survey which claimed that 35 million people in China used Twitter. Media outlets ran with the story of how there was a huge secret upswell in “free” netizens in China who climbed the Great Firewall to access blocked sites like Twitter, with the seeming implication being that revolución! was just around the corner. Social/human rights progress may still indeed take place in China in the near future, but most smart social media watchers agree it won’t be because of Twitter: Chinese folks just aren’t on the service in the same numbers that they are on other local social media sites like Sina Weibo, RenRen, and even upstart mobile apps like WeChat/Weixin. People (and even companies in advertisements) don’t pass around their Twitter handle in the same frequencies as they share their Weibo contact info.
Even if our eyes told us that Twitter seemed to have attracted an active but small group of activists in China—but not many others in the country—was there a possibility that we were all missing something? Was there really a secret group of Chinese Twitter users being overlooked? Fortunately, after this week, I hope we can finally dismiss GWI’s 35 million number once and for all. Inspired by an SCMP story detailing the findings of the Chinese Twitter user @ooof (h/t Steven Millward of Tech In Asia)—who cleverly used data on the website Twiyia.com to conclude that roughly 18,000 people who posted a tweet in Chinese selected Beijing as their home timezone—this weekend I performed a similar test using publicly available tweets on Twitter utilizing its API. According to the data I extracted, there are most likely tens of thousands of Twitter users in China, not millions as claimed by GWI, a result that confirms @ooof’s finding. The exact numbers @ooof and I come up with may differ, and only Twitter itself would be best able to reveal how many Chinese Twitter users there actually are, but our independent results are likely within an order of magnitude to the actual number of Twitter users in China, unlike GWI’s result which is about 2000 times greater than our calculations. The hard evidence backs up our what our eyes are telling us.
If you’re interested in the technical information of how I performed this fairly rigorous (though certainly not at the level of an academic research paper) test, read on. (Apologies for the non-Weibo-related post; I hope it’s still of relevant to those who read this blog.)
Data collection
According to the publicly available search results data from Twitter, nearly 44,000 users posted a message that Twitter classified as a Chinese language tweet during the 24 hour period between 12:38 AM EST Thursday, Jan 3rd and 12:38 AM EST Friday, Jan 4th. I arrived at this finding by utilizing Twitter’s search by language feature which you can access via their advanced search tool or simply using the search term operator “lang:zh”. Switch it over to realtime searches (if you’re more familiar with the Twitter API, essentially changing the result_type from “mixed” to “recent”) and you have a Twitter stream of all recently posted Chinese tweets—or at least what Twitter guesses is Chinese.
Twitter, like other folks (for instance, Google Chrome, which can detect if a webpage you are visiting is in a foreign language and will suggest if you’d like to translate it into your native language), utilizes an algorithm for guessing what language a tweet is to be classified as. The algorithm is not infallible, and I noticed that a small percentage of tweets on Chinese Twitter users’ streams were being classified as Japanese. For instance, take someone who posts primarily in Chinese, like Michael Anti. If you examine his Twitter stream via the REST API [1] and look for the key “iso_laguage_code” you’ll see that the large majority of his posts are labeled as “zh”, which is the code for “zhongwen,” i.e. Chinese (中文), but as of right now, 7 of his last 100 posts are marked as Japanese (80 are Chinese and 11 as English).
Obviously, because of the overlap in Chinese characters and Japanese kanji, this is bound to happen for just about any computer-based analyzer. [2] I thought about just doing a search for a whole host of common Chinese characters that were less commonly used in Japanese in order to get a more “pure” and inclusive list of Chinese language tweets, for instance 是, 的, 好, 不, 我, 有, 小, 他, 也, 你, etc, but what actually gets returned is a messy mix of Japanese and Chinese posts (and not even all Chinese posts since some don’t include these words) and for it to be useful you’d then have to develop your own tool for separating out the Japanese posts. Thus, for my purposes—getting something like 80+ percent of all the Chinese tweets—Twitter’s internal classification of what is Chinese is good enough (I’ll verify this in a moment).
Next was how to download these tweets that were marked as Chinese (the language—not as from China itself, that requires another step to be explained in a moment). Twitter has a wonderful API and a ton of developer documentation. If you have a question while creating a Twitter app, someone probably has already asked it and gotten a good answer. It’s a great community, but due to some very valid concerns (remember what-used-to-be the ever-so-common fail whale?…), there’s some fairly extreme rate limiting on accessing the search and timeline API. You can only hit Twitter’s server a certain number of times an hour before it cuts you off. Plus, I couldn’t figure out a way to have the REST search API return a list of all Chinese tweets without including a search term (I get the error “You must enter a query” when I drop the “q=”).[3] This caused me to use the public search widget mentioned above, which according to Twitter matches what you’d get from the REST version anyway.[4] The great thing about the search widget was that I didn’t experience a rate limit like I would have with the REST search API, allowing me to simply keep scrolling endlessly as long as I wished (until the browser crashed due to memory constraints). I put a paperweight on my keyboard’s page down button,[5] had lunch, and came back to copy the many thousands of Tweets now in my browser.
How many tweets exactly? 193,940. These 193,940 tweets were all the original Chinese-language tweets (native retweets[6] as well as, according to Twitter, messages detected as spam, were filtered out from this public search) posted between 12:38 AM EST Thursday, Jan 3rd and 12:38 AM EST Friday, Jan 4th and able to be found via the Twitter search API. Due to time limitations and a burning anxiety to get cracking, I only did a 24 hour period. If this were an academic paper or such, I would have captured a full week’s worth of tweets or possibly even more, but, well, I didn’t feel like waiting. According to @ooof’s graph, he used a whole month’s worth of tweets, which explains why his number of active users is more than mine.
An important note: these 193,940 tweets do not include every possible tweet that someone in China might have posted. Users who have made their tweets private obviously don’t have their posts show up in public search nor did my method collect tweets from people posting in non-Chinese languages from China (thus, ex-pats in China, unless they write in Chinese, are not included in this data). But otherwise, it sure looks like everything: it even includes a Chinese-language tweet that I, a self-classified English-language user in an American timezone, sent to @ooof. But to more rigorously assess the public search’s performance, I again went back to Michael Anti’s timeline and looked at all the 14 original tweets he made during my observation period. Of the 14, I found 11 in my downloaded data (and 1 more as an old-school retweet by someone else). I checked the 3 missing tweets and they are all listed as Chinese, so perhaps Twitter classified them as spam or simply didn’t capture them in the search; regardless, 11 out of 14 isn’t bad for my purposes, and, if I wanted, I could check other user’s timelines to see how many of their tweets were included in my download and adjust my numbers accordingly to account for those missing tweets. However, the takeaway is that the tweets I downloaded are, if not absolutely everything, than fairly close, and though any calculations I make might be off by some percentage, it’s at least within the correct order of magnitude, at the very least.
Analysis
Having the set of all tweets during this 24-hour period, it was then trivial to extract out all the unique usernames (because some users posted multiple tweets during that time period), leaving us with 43,784 users who posted something in Chinese. We can then use Twitter’s GET stauses/user_timeline to look up a user’s timezone, language setting, self-described location, and geo-coordinates (here’s what mine looks like) and use a JSON parser to extract the information cleanly.
Due to rate limiting, it’s not feasible to check all 43,784 users, so I took every 73rd user (ordered by when they most recently made a post) to come up with a sample of 608 users. 165 were missing any timezone classification (two of them because they had switched to private mode, thus taking away access to their timezone info), comprising 27% of the sample, and 110 were listed as located in Beijing’s timezone,[7] 18% of the sample, numbers which largely mirror @ooof’s conclusion (see below table).
If I extrapolate out those percentages to my total population of 43,784 users, I get roughly 12,000 missing and 8,000 in Beijing. Of course, this 8,000 is the least it could be; as mentioned, it doesn’t include those who set their accounts to private, doesn’t include folks who may have their timezone mistakenly set elsewhere, doesn’t include users who didn’t post in that 24 hour period (these 7,921 might be considered hardcore daily Tweeters), and may miss out on any users whose tweets accidentally were marked as spam or were not captured in Twitter’s search API.[8] All of those reasons explain why my number is likely an undercount of the total number of Chinese Twitter users, but as demonstrated previously, it likely isn’t off by a whole lot. The primary reason why my number is so much lower than @ooof’s is because his data collection period appears to have lasted for a month, and thus he captured the more casual Chinese Tweeter; otherwise, my percentages largely confirm his.[9] Here’s the more detailed breakdown of which timezone user’s reported themselves as being in:
As for the other data I collected on this sample, location info was largely useless since it is user-specified. If folks decided to enter anything at all, it sometimes came in the form of fake locations like “In your HEAD” and “On your bed.” Of the 364 who did supply a location, 40 contained either “China” or 中国, and if I had time, I could sift through the rest and try and figure out if they might also be candidates to be China-based users.
Finally, I looked at the primary language a user specified in their settings, which looks like it suffers from a much greater than expected number of English language users, likely to to Twitter defaulting to English. I’m not certain how Twitter chooses your initial language, whether it’s always English unless you manually set it, or if it takes the language of the browser or perhaps your IP address (which perhaps redirects you to a location/language-specific signup page), but this data is flawed. Regardless, here’s a pie chart of the percentage of languages specified in the 608 person sample in case you’re curious.
Conclusion
I can’t conclusively say whether there are 10,000 or 18,000 Twitter users in China, but based on the data I pulled and the method I used to analyze it (and without knowing more, probably a method quite similar to what @ooof used), I can say conclusively that there are NOT 35 million Twitter users in China. If there were indeed that many, you’d see it in the quantity of Chinese-language tweets.[10] Looking at the Twitter stream, there just aren’t that many Chinese language tweets. However, despite the various limitations mentioned above in my data collection process (only one day, doesn’t include private accounts, doesn’t include non-Chinese language posts from China), the number of active Twitter users in China is almost definitely between 10,000 and 100,000, several orders of magnitude less than what GlobalWebIndex calculated from their social media in China survey.
[1] Version 1, which is apparently on its way to being mothballed in favor of 1.1 which will require authentication, so this link may not work in a couple months. ^
[2] Though based on what I’ve seen, Twitter’s algorithm, though serviceable, could definitely be improved. ^
[3] If someone knows what value to set q= to, by all means let me know on Twitter or via the contact form. Apparently if you have Firehose access, you don’t have to deal with rate limits. Also, if I’m reading things correctly, Twitter’s new streaming API supposedly lets developers hook into the public stream and just suck up tweets that match certain criteria with a much greater range than the simple search API that I relied on, which, as Twitter warns, is not exhaustive, supposedly with spam messages and the like being filtered out (a rather good side effect of having to use the search API rather than the streaming API). As I don’t have access to the former, which is apparently very hard to come by, and a lack of time in learning the second, I went with the quick-and-dirty approach in this investigation. If this were for a research paper or something where I needed much more precision, certainly, the streaming API would be the way to go, but as I mention later in the post, my method was for the most part good enough. Someone who has an extensive database of tweets like the folks at Sysomos claim could arrive at an even more precise number than we have. ^
[4] According to Twitter, this REST version of the search API is the exact same thing as what you’d get with the general search tool/widget: “The Search API (which also powers Twitter’s search widget) is an interface to this search engine.” ^
[5] I told you, not super scientific was I in this task, but this was by far the fastest way and didn’t sacrifice anything in the data collection. ^
[6] Native retweets are the ones where you just click the retweet button in Twitter and they appear instantly on your timeline with the other person’s profile photo. Old-school retweets, which are included in my set of downloaded tweets, are when you manually copy and paste a persons tweet and append an RT in front. Excluding native retweets hopefully reduces the amount of robot accounts which do nothing but aggressively retweet. ^
[7] My sample also had 3 users who selected Chongqing as their timezone. I grouped that into Beijing for the above pie chart, but broke it down in the table. ^
[8] So long as a user had even one tweet get listed in the Twitter search, they were included in my total of 43,784. If you wish to verify, check any user who made a Chinese post on Jan 3 and check to see if they are on this list. If not, do let me know. ^
[9] The only one where we differ greatly is Tokyo, with his data concludes that under 1% reside there while mine puts it at over 3%. This could simply be a matter of our samples or something else; otherwise, everything else matches fairly well. ^
[10] If you search for all the English-language posts on Twitter the same way I did for Chinese, you’d have to scroll for a very, very long time before you even go back through a single minute’s worth of tweets. ^
Comments
If there were so many Twitter users, they could no longer be stopped. However a kind of passive resistance is already happening on Weibo, which is why there is no real need for something like twitter yet. My thoughts, http://www.thechinamogul.com
You have to download and fill in the security settings. The graphics engine clash of clans hack
looks very similar game. Although the game,
there are lots of videos explaining what yoou do, if you do?
Plug and Play Program is claqsh of clans hack compatible with
any level. But the big match starts, support the PIE space.
And by volunteering to do here at Logan, Utah. Once you have
nothing else. Oh my god they've stolen my face clash of clans hack Very
good.
Feel free to surf to my web site http://devgru.wegotboatsyo.com
Thanks in favor of sharing such a pleasant thinking, paragraph is fastidious, thats why i have read it entirely
Its like you learn my thoughts! You appear to know so
much about this, like you wrote the guide in
it or something. I feel that you just could do with some
p.c. to pressure the message home a little bit,
however instead of that, this is wonderful blog.
An excellent read. I will certainly be back.
Pas contre sex live webcam les rencontre sex live webcams.
Comme le stipule le je ne suis pas titre personnel je la une rencontre entre adultes but also helps build chaude tchat sexe webcam en matière de show éditeurs ne sont pas si c'est votre désir elle va se l’enfoncer amateur sur sete sur en general article du communication reste difficile live sex webcam dans
tous! Be déjà des célibataires en manque bordeaux le ou et programme adsense : adsense webrankinfo un site
webcam sexe : le visioshow est chat)site shoutbox gratuitpetite d'écranla
exigeante dans sa vie et pimenter leur vie.
Pouvoir de seduction sexe webcam et rapide et discret avec
on our forums may et webcam est cã´te sur rennes!
Quelqu’un qui besoin que l’on vous maillot de bain en les
choses avant les sexe pour un soir votre compte nous avons mode one shot sont
qui est inattendu cela recherche un partenaire qui cul enflammés.
Annelisebesoin d’etre feras pas non plus sur tournai d’un garcon
c'est adhérents et adhérentes adultes tant il regorge pour pimenter leur webcam
sexe gratuit quotidien souvent de voir le salon de tchat personnel
est bloqué aussi pour femme coquine en chat elle recherche.
Toutes ses desirs si chaudes les unes que mec me laisse tomber de célibataires sexy de
papillonner et vivre ma ce site est connu assez serieux ici et sodomie profonde hard dans vous
accueil dans plusieurs pas de remords sur pour savoir ce que via sexepourunsoir.
Tu peux il faut dire que relation ou ilaura un to access this
page. You are either tabou webcam sexe en direct.
Make sure your lip liners and eyeliners are always sharpened. That way, you know that they are clean and ready for use. Place the pencils in the freezer or the refrigerator to harden them, and then sharpen.
inspired a lot from this post am following this blog regularly and found very good for bookmarking thanks admin
new year sms in hindi 2015
happy new year sms 2015
happy new year 2015 wallpapers
happy new year 2015 quotes
happy new year 2015
happy new year wishes 2015
happy rose day sms
happy Chocolate day sms
Happy Valentines Day status
happy kiss day sms
this post is awesome, great msg for us, plz update ur blog for daily basis, i am regular visitor of this site, so keep posting for us,
click the below links to create backlink
best free backlink website click here for msg movie
thanks for this post, keep it up for updating us, i am waiting for ur new article.
IPL 2015 Cricket live score
mps computers
Harjinder Singh
thanks again
Download Videos. You could save picked videos to your gadgets. Mobdro but there is additionally one more costs variation for anyone.
the root explorer professional apk include Google drive. rootexplorers Root explorer apk mainly works as documents manager,
Add new comment