Maybe you’ve noticed, it’s impossible to search for media files on Github. Searching Github is for code only. You might find references to media files in code, but no more. This is pretty annoying although understandable for two reasons:
- Github targets developers and, as such, focuses on tools that are relevant to them.
- The open source licenses that Github promotes for its public projects are maybe not always the most relevant or friendly ones when applied to media content. So, it’s just a supposition but, by preventing search for media files, Github avoids getting in trouble for actually hosting content that stands in a the gray area of open source licensing.
Anyway, since I’m very interested in how designers are using Github for their projects, I conducted my own study and started indexing as many projects as I could, mainly storing references to the media files they contained. And after more than 2 weeks of constant querying their API − with a little help of my friend Olm– − I managed to store information from ~500.000 original public projects. That’s a little more than 1% of all the projects that exists on Github so far (44,444,444 at the time I’m writing this blog post).
1% is a pretty small number, but the API is limited to 5000 calls per hour. It would take me years to get the whole data and certainly more as Github growth seems accelerating. But for the purpose of this study, it should be pretty enough. The goal is to get a sense of what’s popular. These 500.000 projects are also what I call “original”, which means they are not “forks” of other projects. So it overall might represents more projects than this 1%.
Another disclaimer before getting into the data, when I say media files, I actually searched for files with certain extensions. I used a list of 210 popular and not so popular media file extensions, compiled with the help of Wikipedia and others. Again, a trade off here due to time and space constraints. I could have missed some big ones that I never heard of. Although I hope its unlikely.
Ok, so with 1% of Github in my hands, it’s starts to be interesting to make assumptions about the big picture.
Out of the 546,574 projects, only 52,564 have been forked at least once. That’s barely 10%. But those 10% have produced 276,118 forks. So maybe overall 30% of Github is forks and 6% is original projects that have been forked. Yeah, open source is hard. The rest is empty projects (20% of the originals I downloaded), deleted ones and the occasional spam.
Surprisingly, Github gets spammed, a little. And the not-super-smart spammers are just filling the description of projects with their trash content, which makes it easy for Github to spot, I guess. Why are those spammy repositories still available from the API is a wonder to me.
550,000 projects represents a total of 130,000,000 files of which 12% are media files. Extrapolate this and Gihtub might host more than 1,5 billion media files. Quite a resource if we could only search through it. Anyway, as expected, the most popular media files are the PNG, JPG, GIF and SVG.
What’s interesting to see here is that after PDF, which Github allows you to view in the web interface, comes two font formats (TTF and WOFF) that are very popular with web designers also, but for some reason, Github is not displaying. Actually, the next format that Github offers a preview of comes on the 11th position in this graph, the famous PSD. In between, we have many formats that could easily be previewed in a browser, but Github does not seem to care.
The little surprise here for me is the amount of OGG, MP3 and WAV files available. I certainly did not expect that. Seeing also that the ASSET file type is quite popular (a file format used in game design with Unity) and considering that game development tools overlap web development tools these days, all of this starts to make sense. Sound is an important part of any interactive experience, being a web/app interface or a game. Again, these sound files could be easily previewed in a browser.
Lastly, let’s consider STL, the last file format displayed here (and 30th in position). It’s the common file format for exchanging object files used in 3D printing. Github has a preview for it and even shows some form of “3D diff” between commits. Great, but on 13th position, we have OBJ, also an open 3D format, that counts 5 times more files on Github than STL. To my knowledge, it’s not more complicated to display an OBJ file in the browser than a STL one. So what’s the logic here?
To wrap this up, Github could do so much more with not so much effort to allow previews in the browser of some important media file formats for designers. Maybe the “licensing” trouble described at the beginning is not a bad supposition after all. I’d be certainly happy to hear Github’s take on this. If you know anyone working there, thanks for forwarding these questions, and if anyone there is listening, I’d be pleased to dig more deeply into your data to understand more how designers (could) use your product.