There are NOT millions of Twitter users in China. Here's proof

Originally posted by Jason Q. Ng at Blocked on Weibo, republished with permission.

The question of how many Chinese Twitter users there are made headlines a few months back when the market research company GlobalWebIndex published results from a survey which claimed that 35 million people in China used Twitter. Media outlets ran with the story of how there was a huge secret upswell in “free” netizens in China who climbed the Great Firewall to access blocked sites like Twitter, with the seeming implication being that revolución! was just around the corner. Social/human rights progress may still indeed take place in China in the near future, but most smart social media watchers agree it won’t be because of Twitter: Chinese folks just aren’t on the service in the same numbers that they are on other local social media sites like Sina Weibo, RenRen, and even upstart mobile apps like WeChat/Weixin. People (and even companies in advertisements) don’t pass around their Twitter handle in the same frequencies as they share their Weibo contact info.

Even if our eyes told us that Twitter seemed to have attracted an active but small group of activists in China—but not many others in the country—was there a possibility that we were all missing something? Was there really a secret group of Chinese Twitter users being overlooked? Fortunately, after this week, I hope we can finally dismiss GWI’s 35 million number once and for all. Inspired by an SCMP story detailing the findings of the Chinese Twitter user @ooof (h/t Steven Millward of Tech In Asia)—who cleverly used data on the website Twiyia.com to conclude that roughly 18,000 people who posted a tweet in Chinese selected Beijing as their home timezone—this weekend I performed a similar test using publicly available tweets on Twitter utilizing its API. According to the data I extracted, there are most likely tens of thousands of Twitter users in China, not millions as claimed by GWI, a result that confirms @ooof’s finding. The exact numbers @ooof and I come up with may differ, and only Twitter itself would be best able to  reveal how many Chinese Twitter users there actually are, but our independent results are likely within an order of magnitude to the actual number of Twitter users in China, unlike GWI’s result which is about 2000 times greater than our calculations. The hard evidence backs up our what our eyes are telling us.

If you’re interested in the technical information of how I performed this fairly rigorous (though certainly not at the level of an academic research paper) test, read on. (Apologies for the non-Weibo-related post; I hope it’s still of relevant to those who read this blog.)

Data collection

According to the publicly available search results data from Twitter, nearly 44,000 users posted a message that Twitter classified as a Chinese language tweet during the 24 hour period between 12:38 AM EST Thursday, Jan 3rd and 12:38 AM EST Friday, Jan 4th. I arrived at this finding by utilizing Twitter’s search by language feature which you can access via their advanced search tool or simply using the search term operator “lang:zh”. Switch it over to realtime searches (if you’re more familiar with the Twitter API, essentially changing the result_type from “mixed” to “recent”) and you have a Twitter stream of all recently posted Chinese tweets—or at least what Twitter guesses is Chinese.

Twitter, like other folks (for instance, Google Chrome, which can detect if a webpage you are visiting is in a foreign language and will suggest if you’d like to translate it into your native language), utilizes an algorithm for guessing what language a tweet is to be classified as. The algorithm is not infallible, and I noticed that a small percentage of tweets on Chinese Twitter users’ streams were being classified as Japanese. For instance, take someone who posts primarily in Chinese, like Michael Anti. If you examine his Twitter stream via the REST API [1] and look for the key “iso_laguage_code” you’ll see that the large majority of his posts are labeled as “zh”, which is the code for “zhongwen,” i.e. Chinese (中文), but as of right now, 7 of his last 100 posts are marked as Japanese (80 are Chinese and 11 as English).

image

Obviously, because of the overlap in Chinese characters and Japanese kanji, this is bound to happen for just about any computer-based analyzer. [2] I thought about just doing a search for a whole host of common Chinese characters that were less commonly used in Japanese in order to get a more “pure” and inclusive list of Chinese language tweets, for instance , , , , , , , , , , etc, but what actually gets returned is a messy mix of Japanese and Chinese posts (and not even all Chinese posts since some don’t include these words) and for it to be useful you’d then have to develop your own tool for separating out the Japanese posts. Thus, for my purposes—getting something like 80+ percent of all the Chinese tweets—Twitter’s internal classification of what is Chinese is good enough (I’ll verify this in a moment).

Next was how to download these tweets that were marked as Chinese (the language—not as from China itself, that requires another step to be explained in a moment). Twitter has a wonderful API and a ton of developer documentation. If you have a question while creating a Twitter app, someone probably has already asked it and gotten a good answer. It’s a great community, but due to some very valid concerns (remember what-used-to-be the ever-so-common fail whale?…), there’s some fairly extreme rate limiting on accessing the search and timeline API. You can only hit Twitter’s server a certain number of times an hour before it cuts you off. Plus, I couldn’t figure out a way to have the REST search API return a list of all Chinese tweets without including a search term (I get the error “You must enter a query” when I drop the “q=”).[3] This caused me to use the public search widget mentioned above, which according to Twitter matches what you’d get from the REST version anyway.[4] The great thing about the search widget was that I didn’t experience a rate limit like I would have with the REST search API, allowing me to simply keep scrolling endlessly as long as I wished (until the browser crashed due to memory constraints). I put a paperweight on my keyboard’s page down button,[5] had lunch, and came back to copy the many thousands of Tweets now in my browser.

How many tweets exactly? 193,940. These 193,940 tweets were all the original Chinese-language tweets (native retweets[6] as well as, according to Twitter, messages detected as spam, were filtered out from this public search) posted between 12:38 AM EST Thursday, Jan 3rd and 12:38 AM EST Friday, Jan 4th and able to be found via the Twitter search API. Due to time limitations and a burning anxiety to get cracking, I only did a 24 hour period. If this were an academic paper or such, I would have captured a full week’s worth of tweets or possibly even more, but, well, I didn’t feel like waiting. According to @ooof’s graph, he used a whole month’s worth of tweets, which explains why his number of active users is more than mine.

An important note: these 193,940 tweets do not include every possible tweet that someone in China might have posted. Users who have made their tweets private obviously don’t have their posts show up in public search nor did my method collect tweets from people posting in non-Chinese languages from China (thus, ex-pats in China, unless they write in Chinese, are not included in this data). But otherwise, it sure looks like everything: it even includes a Chinese-language tweet that I, a self-classified English-language user in an American timezone, sent to @ooof. But to more rigorously assess the public search’s performance, I again went back to Michael Anti’s timeline and looked at all the 14 original tweets he made during my observation period. Of the 14, I found 11 in my downloaded data (and 1 more as an old-school retweet by someone else). I checked the 3 missing tweets and they are all listed as Chinese, so perhaps Twitter classified them as spam or simply didn’t capture them in the search; regardless, 11 out of 14 isn’t bad for my purposes, and, if I wanted, I could check other user’s timelines to see how many of their tweets were included in my download and adjust my numbers accordingly to account for those missing tweets. However, the takeaway is that the tweets I downloaded are, if not absolutely everything, than fairly close, and though any calculations I make might be off by some percentage, it’s at least within the correct order of magnitude, at the very least.

Analysis

Having the set of all tweets during this 24-hour period, it was then trivial to extract out all the unique usernames (because some users posted multiple tweets during that time period), leaving us with 43,784 users who posted something in Chinese. We can then use Twitter’s GET stauses/user_timeline to look up a user’s timezone, language setting, self-described location, and geo-coordinates (here’s what mine looks like) and use a JSON parser to extract the information cleanly.

Due to rate limiting, it’s not feasible to check all 43,784 users, so I took every 73rd user (ordered by when they most recently made a post) to come up with a sample of 608 users. 165 were missing any timezone classification (two of them because they had switched to private mode, thus taking away access to their timezone info), comprising 27% of the sample, and 110 were listed as located in Beijing’s timezone,[7] 18% of the sample, numbers which largely mirror @ooof’s conclusion (see below table).

image

If I extrapolate out those percentages to my total population of 43,784 users, I get roughly 12,000 missing and 8,000 in Beijing. Of course, this 8,000 is the least it could be; as mentioned, it doesn’t include those who set their accounts to private, doesn’t include folks who may have their timezone mistakenly set elsewhere, doesn’t include users who didn’t post in that 24 hour period (these 7,921 might be considered hardcore daily Tweeters), and may miss out on any users whose tweets accidentally were marked as spam or were not captured in Twitter’s search API.[8] All of those reasons explain why my number is likely an undercount of the total number of Chinese Twitter users, but as demonstrated previously, it likely isn’t off by a whole lot. The primary reason why my number is so much lower than @ooof’s is because his data collection period appears to have lasted for a month, and thus he captured the more casual Chinese Tweeter; otherwise, my percentages largely confirm his.[9] Here’s the more detailed breakdown of which timezone user’s reported themselves as being in:

As for the other data I collected on this sample, location info was largely useless since it is user-specified. If folks decided to enter anything at all, it sometimes came in the form of fake locations like “In your HEAD” and “On your bed.” Of the 364 who did supply a location, 40 contained either “China” or 中国, and if I had time, I could sift through the rest and try and figure out if they might also be candidates to be China-based users.

Finally, I looked at the primary language a user specified in their settings, which looks like it suffers from a much greater than expected number of English language users, likely to to Twitter defaulting to English. I’m not certain how Twitter chooses your initial language, whether it’s always English unless you manually set it, or if it takes the language of the browser or perhaps your IP address (which perhaps redirects you to a location/language-specific signup page), but this data is flawed. Regardless, here’s a pie chart of the percentage of languages specified in the 608 person sample in case you’re curious.

image

Conclusion

I can’t conclusively say whether there are 10,000 or 18,000 Twitter users in China, but based on the data I pulled and the method I used to analyze it (and without knowing more, probably a method quite similar to what @ooof used), I can say conclusively that there are NOT 35 million Twitter users in China. If there were indeed that many, you’d see it in the quantity of Chinese-language tweets.[10] Looking at the Twitter stream, there just aren’t that many Chinese language tweets. However, despite the various limitations mentioned above in my data collection process (only one day, doesn’t include private accounts, doesn’t include non-Chinese language posts from China), the number of active Twitter users in China is almost definitely between 10,000 and 100,000, several orders of magnitude less than what GlobalWebIndex calculated from their social media in China survey.


Notes

[1] Version 1, which is apparently on its way to being mothballed in favor of 1.1 which will require authentication, so this link may not work in a couple months. ^

[2] Though based on what I’ve seen, Twitter’s algorithm, though serviceable, could definitely be improved. ^

[3] If someone knows what value to set q= to, by all means let me know on Twitter or via the contact form. Apparently if you have Firehose access, you don’t have to deal with rate limits. Also, if I’m reading things correctly, Twitter’s new streaming API supposedly lets developers hook into the public stream and just suck up tweets that match certain criteria with a much greater range than the simple search API that I relied on, which, as Twitter warns, is not exhaustive, supposedly with spam messages and the like being filtered out (a rather good side effect of having to use the search API rather than the streaming API). As I don’t have access to the former, which is apparently very hard to come by, and a lack of time in learning the second, I went with the quick-and-dirty approach in this investigation. If this were for a research paper or something where I needed much more precision, certainly, the streaming API would be the way to go, but as I mention later in the post, my method was for the most part good enough. Someone who has an extensive database of tweets like the folks at Sysomos claim could arrive at an even more precise number than we have. ^

[4] According to Twitter, this REST version of the search API is the exact same thing as what you’d get with the general search tool/widget: “The Search API (which also powers Twitter’s search widget) is an interface to this search engine.” ^

[5] I told you, not super scientific was I in this task, but this was by far the fastest way and didn’t sacrifice anything in the data collection. ^

[6] Native retweets are the ones where you just click the retweet button in Twitter and they appear instantly on your timeline with the other person’s profile photo. Old-school retweets, which are included in my set of downloaded tweets, are when you manually copy and paste a persons tweet and append an RT in front. Excluding native retweets hopefully reduces the amount of robot accounts which do nothing but aggressively retweet. ^

[7] My sample also had 3 users who selected Chongqing as their timezone. I grouped that into Beijing for the above pie chart, but broke it down in the table. ^

[8] So long as a user had even one tweet get listed in the Twitter search, they were included in my total of 43,784. If you wish to verify, check any user who made a Chinese post on Jan 3 and check to see if they are on this list. If not, do let me know. ^

[9] The only one where we differ greatly is Tokyo, with his data concludes that under 1% reside there while mine puts it at over 3%. This could simply be a matter of our samples or something else; otherwise, everything else matches fairly well. ^

[10] If you search for all the English-language posts on Twitter the same way I did for Chinese, you’d have to scroll for a very, very long time before you even go back through a single minute’s worth of tweets. ^

Comments

More Blog Posts

Subscribe to our mailing list
Show content from Blog | Google+ | Twitter | All. Subscribe to our blog using RSS.

Thu, Nov 30, 2017

About those 674 apps that Apple censored in China

Apple opened the door on its censorship practices in China - but just a crack.

Tue, May 23, 2017

Is China establishing cyber sovereignty in the United States?

Last week Twitter came under attack from a DDoS attack orchestrated by the Chinese authorities. While such attacks are not uncommon for websites like Twitter, this one proved unusual. While the Chinese authorities use the Great Firewall to block harmful content from reaching its citizens, it now uses DDoS attacks to take down content that appears on websites beyond its borders. For the Chinese authorities, it is not simply good enough to “protect” the interests of Chinese citizens at home - in their view of cyber sovereignty, any content that might harm China’s interests must be removed, regardless of where the website is located.

And so last week the Chinese authorities determined that Twitter was the target. In particular, the authorities targeted the Twitter account for Guo Wengui (https://twitter.com/KwokMiles), the rebel billionaire who is slowly leaking information about corrupt Chinese government officials via his Twitter account and through his YouTube videos. Guo appeared to ramp up his whistle-blowing efforts last week and the Chinese authorities, in turn, ramped up theirs.

via https://twitter.com/KwokMiles/status/863689935798374401

Mon, Dec 12, 2016

China is the obstacle to Google’s plan to end internet censorship

It’s been three years since Eric Schmidt proclaimed that Google would chart a course to ending online censorship within ten years. Now is a great time to check on Google’s progress, reassess the landscape, benchmark Google’s efforts against others who share the same goal, postulate on the China strategy and offer suggestions on how they might effectively move forward.

flowers on google china plaque

Flowers left outside Google China’s headquarters after its announcement it might leave the country in 2010. Photo: Wikicommons.

What has Google accomplished since November 2013?

The first thing they have accomplished is an entire rebranding of both Google (now Alphabet) and Google Ideas (now Jigsaw). Throughout this blog post, reference is made to both new and old company names.

Google has started to develop two main tools which they believe can help in the fight against censorship. Jigsaw’s DDoS protection service, Project Shield, is effectively preventing censorship-inspired DDoS attacks and recently helped to repel an attack on Brian Krebs’ blog. The service is similar to other anti-DDoS services developed by internet freedom champions and for-profit services like Cloudflare.

Thu, Nov 24, 2016

Facebook: Please, not like this

Facebook is considering launching a censorship tool that would enable the world’s biggest social network to “enter” the China market. Sadly, nobody will be surprised by anything that Mark Zuckerberg decides to do in order to enter the China market. With such low expectations, Facebook is poised to usurp Apple as China’s favorite foreign intelligence gathering partner. If the company launches in China using this strategy they will also successfully erase any bargaining power that other media organizations may hold with the Chinese authorities.

Tue, Jul 05, 2016

GreatFire.org 现在开始测试VPN在中国的速度和稳定性

在中国有一个普遍观念,如果你有一个可以使用的VPN,那么你应该保持沉默。就信息自由而言,这种观念的问题在于获取知识竟成了一种秘密。今天,我们推出一个项目,希望能够摧毁这种模型。

我们最新的网站,翻墙中心,目的在于实时提供那些能够在中国使用的翻墙方案的信息和数据。在2011年以来我们就已经开始收集在中国被屏蔽的网站,现在我们也将增加那些可用的VPN和其他翻墙工具。

我们发布翻墙中心主要有四个目的。

我们的首要目标是助长使用翻墙工具的国人的数量。通过分享我们这些工具的信息和数据,我们希望对更广泛的受众展示那些工具时可以使用的。

我们的第二个目标是通过带来工具性能的透明化来提升中国用户的翻墙体验。我们将会测试工具的速度(流行网站的加载速度)和稳定性(流行网站加载成功的程度)。

我们开发速度测试的目的是要真实反映用户的体验。当用户在网站测速时,浏览器在后台会从10个世界上最流行的网站上下载一些资源文件。根据Alexa排名,这些网站分别是Google, Facebook, YouTube, Baidu, Amazon, Yahoo, Wikipedia, QQ, Twitter and Microsoft Live。速度的结果是简单的计算下载文件文件的大小和下载所需的时间。我们同样也会验证下载的文件是否完整。如果文件的内容是错误的或者在40秒内无法完成下载,我们会标记为失败。这个数据被我们用来生成另一个重要指标-稳定性。

其他的速度测试工具仅仅是通过发送数据到它们自己的服务器来测量上传和下载的速度。这种数据无法反应用户的体验,因为正常的浏览器通常会频繁的发送一系列的请求(而不是上传或下载一个大文件)到许多的服务器,而不止是一个。

我们的第二个指标 - 稳定性 - 是其他的服务通常不会测试的。一个健康的互联网连接应该达到100%的稳定性,除非有人在测试中把网线拔了。但是在中国使用翻墙工具却不是这样。任何时候连接都有可能变得不稳定或十分缓慢。根据请求的大小,最终的地点和代理的方式,一些请求有可能会失败。比较服务的稳定性要比比较速度更加重要。

你可以测试任意的翻墙工具,列表之外的也可以。中国的VPN用户也可以测试他们的工具,测试结果也会添加到数据库中。这些数据都将会对所有人开放。实时的在中国测试是非常重要的,因为VPN随时都可能被封锁或解封。我们欢迎任何的关于测试过程的反馈。有技术能力的用户也可以通过审查我们的javascript代码来获悉我们的测试是如何工作的。

我们郑重的邀请翻墙工具的开发者们向我们提供测试过程的反馈。我们的第三个目标是帮助这些开发人员改进他们的产品,让更多的选择适用于中国的顾客。此外,越多的工具可以工作,就意味着中国当局对翻墙的打击就会越难。

中国的用户都知道,在过去的18个月中当局加紧了对翻墙工具的攻击。而翻墙中心将会吹响反击的号角。反其道而行之,让这不再成为秘密。我们要鼓励人们分享翻墙工具可以工作的信息。

我们的第四个目标就是要为GreatFire.org创造收益。目前GreatFire仍然依靠世界各地的热心人士和组织的捐款。我们希望减少对这些机构的依赖,并探寻GreatFire.org自给自足的道路。用户只需到翻墙中心就能购买任意一款我们目前在测试的付费工具。GreatFire将作为这些工具在中国的经销商,因此VPN供应商会给予我们每个零售的一部分。用户也不必在中国购买这些翻墙服务。

Subscribe to our blog using RSS.

Comments

If there were so many Twitter users, they could no longer be stopped. However a kind of passive resistance is already happening on Weibo, which is why there is no real need for something like twitter yet. My thoughts, http://www.thechinamogul.com

You have to download and fill in the security settings. The graphics engine clash of clans hack
looks very similar game. Although the game,
there are lots of videos explaining what yoou do, if you do?
Plug and Play Program is claqsh of clans hack compatible with
any level. But the big match starts, support the PIE space.
And by volunteering to do here at Logan, Utah. Once you have
nothing else. Oh my god they've stolen my face clash of clans hack Very
good.

Feel free to surf to my web site http://devgru.wegotboatsyo.com

Thanks in favor of sharing such a pleasant thinking, paragraph is fastidious, thats why i have read it entirely

Its like you learn my thoughts! You appear to know so
much about this, like you wrote the guide in
it or something. I feel that you just could do with some
p.c. to pressure the message home a little bit,
however instead of that, this is wonderful blog.
An excellent read. I will certainly be back.

Pas contre sex live webcam les rencontre sex live webcams.
Comme le stipule le je ne suis pas titre personnel je la une rencontre entre adultes but also helps build chaude tchat sexe webcam en matière de show éditeurs ne sont pas si c'est votre désir elle va se l’enfoncer amateur sur sete sur en general article du communication reste difficile live sex webcam dans
tous! Be déjà des célibataires en manque bordeaux le ou et programme adsense : adsense webrankinfo un site
webcam sexe : le visioshow est chat)site shoutbox gratuitpetite d'écranla
exigeante dans sa vie et pimenter leur vie.
Pouvoir de seduction sexe webcam et rapide et discret avec
on our forums may et webcam est cã´te sur rennes!

Quelqu’un qui besoin que l’on vous maillot de bain en les
choses avant les sexe pour un soir votre compte nous avons mode one shot sont
qui est inattendu cela recherche un partenaire qui cul enflammés.
Annelisebesoin d’etre feras pas non plus sur tournai d’un garcon
c'est adhérents et adhérentes adultes tant il regorge pour pimenter leur webcam
sexe gratuit quotidien souvent de voir le salon de tchat personnel
est bloqué aussi pour femme coquine en chat elle recherche.
Toutes ses desirs si chaudes les unes que mec me laisse tomber de célibataires sexy de
papillonner et vivre ma ce site est connu assez serieux ici et sodomie profonde hard dans vous
accueil dans plusieurs pas de remords sur pour savoir ce que via sexepourunsoir.
Tu peux il faut dire que relation ou ilaura un to access this
page. You are either tabou webcam sexe en direct.

Make sure your lip liners and eyeliners are always sharpened. That way, you know that they are clean and ready for use. Place the pencils in the freezer or the refrigerator to harden them, and then sharpen.

inspired a lot from this post am following this blog regularly and found very good for bookmarking thanks admin
new year sms in hindi 2015
happy new year sms 2015
happy new year 2015 wallpapers
happy new year 2015 quotes
happy new year 2015
happy new year wishes 2015

this post is awesome, great msg for us, plz update ur blog for daily basis, i am regular visitor of this site, so keep posting for us,

click the below links to create backlink
best free backlink website
click here for msg movie

thanks for this post, keep it up for updating us, i am waiting for ur new article.
IPL 2015 Cricket live score
mps computers
Harjinder Singh

thanks again

Download Videos. You could save picked videos to your gadgets. Mobdro but there is additionally one more costs variation for anyone.

the root explorer professional apk include Google drive. rootexplorers Root explorer apk mainly works as documents manager,

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.