A patch for the Github centralization dilemma

Github 404

Github, with its 75,000,000 repositories, has become a central place for open source development and is well-known for having popularized Git among programmers and other code hungry fellas. The irony is not lost on anyone that we are again relying on a centralized service for our decentralized Git workflow. And as with any centralization comes the risk of giving too much power in the hands of just a few.

Of course, a central service such as Github has its benefits. We all know where to search for code. We all also potentially know how the service works and can jump more quickly from one project to another. Third parties can even build upon this resource and push things in new directions, maybe attracting faster early adopters.

But… Centralized services can turn against you. They can censor and be censored. They also can disappear. Maybe Github will not disappear soon, but a user on Github could decide to delete all its repositories and there would not be much you could do about it. You don’t think that has happened? Check RGBDToolkit or Gravit, for example. (You’ll have to put those urls in your preferred search-engine to verify that I’m not bullshitting you and that these projects did exist on Github at some point.)

So, in order to restore balance in the force, I’ve decided to adopt a few habits that I want to share with you. They are not going to solve the centralization problem. But they can maybe provide some safe guards against the major risk exposed in the previous paragraph. These tricks apply for projects you have not created. For your own projects, it’s up to you to decide where you want to host them.

The solution I’m using is based on the mirror feature from Gitlab. Gitlab is an open source clone of Github. It provides the same functionalities, but you can install it on your own server. And many groups are running public instances across the web. Gitlab.com itself, as a company, develops the software and offers hosting of public and private repositories at the same address.

So now, every time I find a nice open source project on Github, and especially the ones with few stars, forks or developers, I create a mirror of it in a public Gitlab repository. The advantage here over just a git clone on my machine or elsewhere is that I’m not just creating a copy of the project at a certain time. The mirror feature will keep watching the original project and pull all the changes that happen after I created the mirror. So I’m confident, that whatever happens to the original repository, all the history and changes will be saved elsewhere.

Because those repositories are just backups, I also disable issue tracking, wikis and any other unnecessary feature that could mislead visitors. The point is not to divert development. There is also a clear mention that those are mirrors and link back to the canonical repository.

So next time, instead of starring a project you like, mirror it. You’ll do everyone a favor. The ones I keep are here. But feel free to choose any other hosting service elsewhere. Let’s keep things distributed.

Git versioning and diff visualizing tools for designers

Git for Designers (1st slide)

Here is the video of my presentation at the Libre Graphics Meeting 2016, in London. For the most part, I expose my quest for a Git based visualizing tool that could help designers integrate a version control workflow.

The slides are viewable from here. You can also download them from this Gitlab repo.

If you find this video interesting or lacking more in-depth information about the subject, please have a look at these detailed blog posts:

  1. Collaborative tools for designers – Part 1
  2. Collaborative tools for designers – Part 2 : Dropbox
  3. Collaborative tools for designers – Part 3: Pixelapse
  4. Collaborative tools for designers – Part 4
  5. Collaborative tools for designers – Part 5: Adobe Creative Cloud
  6. Collaborative tools for designers – Part 6: the Githosters
  7. Github, why u show no more media files

 

Github, why u no show more media files?

Break down of media files on GithubMaybe you’ve noticed, it’s impossible to search for media files on Github. Searching Github is for code only. You might find references to media files in code, but no more. This is pretty annoying although understandable for two reasons:

  1. Github targets developers and, as such, focuses on tools that are relevant to them.
  2. The open source licenses that Github promotes for its public projects are maybe not always the most relevant or friendly ones when applied to media content. So, it’s just a supposition but, by preventing search for media files, Github avoids getting in trouble for actually hosting content that stands in a the gray area of open source licensing.

Anyway, since I’m very interested in how designers are using Github for their projects, I conducted my own study and started indexing as many projects as I could, mainly storing references to the media files they contained. And after more than 2 weeks of constant querying their API − with a little help of my friend Olm– − I managed to store information from ~500.000 original public projects. That’s a little more than 1% of all the projects that exists on Github so far (44,444,444 at the time I’m writing this blog post).

1% is a pretty small number, but the API is limited to 5000 calls per hour. It would take me years to get the whole data and certainly more as Github growth seems accelerating. But for the purpose of this study, it should be pretty enough. The goal is to get a sense of what’s popular. These 500.000 projects are also what I call “original”, which means they are not “forks” of other projects. So it overall might represents more projects than this 1%.

Another disclaimer before getting into the data, when I say media files, I actually searched for files with certain extensions. I used a list of 210 popular and not so popular media file extensions, compiled with the help of Wikipedia and others. Again, a trade off here due to time and space constraints. I could have missed some big ones that I never heard of. Although I hope its unlikely.

Ok, so with 1% of Github in my hands, it’s starts to be interesting to make assumptions about the big picture.

Out of the 546,574 projects, only 52,564 have been forked at least once. That’s barely 10%. But those 10% have produced 276,118 forks. So maybe overall 30% of Github is forks and 6% is original projects that have been forked. Yeah, open source is hard. The rest is empty projects (20% of the originals I downloaded), deleted ones and the occasional spam.

Surprisingly, Github gets spammed, a little. And the not-super-smart spammers are just filling the description of projects with their trash content, which makes it easy for Github to spot, I guess. Why are those spammy repositories still available from the API is a wonder to me.

550,000 projects represents a total of 130,000,000 files of which 12% are media files. Extrapolate this and Gihtub might host more than 1,5 billion media files. Quite a resource if we could only search through it. Anyway, as expected, the most popular media files are the PNG, JPG, GIF and SVG.

This is understandable as Github is the go to place if you’re into web design, whether its javascript libraries, CSS frameworks or icon sets. Github also offers static website hosting that attracts a lot of people. But let’s have a deeper look at the “others”. What’s popular and how does it break down?

What’s interesting to see here is that after PDF, which Github allows you to view in the web interface, comes two font formats (TTF and WOFF) that are very popular with web designers also, but for some reason, Github is not displaying. Actually, the next format that Github offers a preview of comes on the 11th position in this graph, the famous PSD. In between, we have many formats that could easily be previewed in a browser, but Github does not seem to care.

The little surprise here for me is the amount of OGG, MP3 and WAV files available. I certainly did not expect that. Seeing also that the ASSET file type is quite popular (a file format used in game design with Unity) and considering that game development tools overlap web development tools these days, all of this starts to make sense. Sound is an important part of any interactive experience, being a web/app interface or a game. Again, these sound files could be easily previewed in a browser.

Lastly, let’s consider STL, the last file format displayed here (and 30th in position). It’s the common file format for exchanging object files used in 3D printing. Github has a preview for it and even shows some form of “3D diff” between commits.  Great, but on 13th position, we have OBJ, also an open 3D format, that counts 5 times more files on Github than STL. To my knowledge, it’s not more complicated to display an OBJ file in the browser than a STL one. So what’s the logic here?

To wrap this up, Github could do so much more with not so much effort to allow previews in the browser of some important media file formats for designers. Maybe the “licensing” trouble described at the beginning is not a bad supposition after all. I’d be certainly happy to hear Github’s take on this. If you know anyone working there, thanks for forwarding these questions, and if anyone there is listening, I’d be pleased to dig more deeply into your data to understand more how designers (could) use your product.

Teasing : the most popular media file formats on Github

In my process of studying collaborative tools for designers, I took a deeper look at Github to find out how much media files were hosted there, of which type, etc. I’m just using the API provided by Github. No magic trick here. Although it’s a long process due to the API call limitations. There is 43.000.000 projects on Github. But I’m close to have gone over 1%, which is the lower limit I was reaching for before making any assumptions. So here under is just an small infographic to  tease you and make you impatient for the larger study I plan to release in a couple of days. Enjoy.

 

 

Took also the opportunity to test Infogr.am. Still not sure if I’ll use their service for the following article. Any suggestion or remarks?