The public availability of free/libre open source software (FLOSS) repositories (like SourceForge) has led to a substantial increase of quantitative empirical research in software engineering. Researchers mine those repositories to get a better understanding (and thereby control) of the complex and constantly evolving socio-technical interactions that take place during a long software development project. For example, researchers use historical data to predict the most problematic modules requiring more attention from developers or to advise which developers are the most knowledgeable about a certain module.
However, such research is not carried out in the most efficient way: most research teams create their own mining infrastructure; the same FLOSS project data is collected and processed repeatedly; the mining processes and the obtained results are not shared in sufficient detail to replicate experiments on different data or to tailor them to different research questions.
Many tools have been built to extract the necessary information from historic software repositories, process and analyze it, and then visualize the results (e.g. Moose). Unfortunately, many of those tools are not publicly available or they use their own internal representations and input/output formats, which makes it difficult to reuse such tools for different purposes. Some researchers have advocated for a common tool platform and a common repository representation format (see the TA-RE paper), but it largely remains an unaccomplished goal. A possible alternative is to ‘glue’ together existing tools into a chain (e.g. FETCH, which includes Crocopat), via some scripts that translate one tool’s output into the next tool’s input. Besides tools, some repositories that gather metadata about thousands of FLOSS projects have been built, like FLOSSMetrics and FLOSSmole.
The aim within this topic is to improve the existing mining infrastructure so that researchers can use it in a more efficient way to express more sophisticated research questions. Possible improvements include:
- Extend existing tool chains and environments
- Build middleware to coordinate independent tools and FLOSS repositories.
- Apply techniques and tools from e-Science to mining software repositories
- Define ontologies for mining software repositories and wrap existing tools as web services
Skills and Background required
The candidate should have good programming skills. Depending on the actual research project the applicant wishes to pursue, some knowledge of relational databases, statistics or software visualisation may be required. Off-campus (i.e. part-time) students should have a fast PC with a large disk to store and analyze several large repositories.
References
- Du Bois et al. (2008) Supporting Reengineering Scenarios with; FETCH: an Experience Report. Electronic Communications of the EASST, volume 8.
- Fernandez-Ramil et al. (2008) Empirical Studies of Open Source Evolution. In Software Evolution, pp. 263-288. Springer Verlag.
- Ghezzi and Gall (2008) Towards software analysis as a service. In Proc. of the Workshops of ASE’08.
- Howison and Wiggins (2008) eResearch Workflows for Studying Free and Open Source Software; Development. In: Open Source Development, Communities and Quality. IFIP 275, pp. 405-411. Springer.
- Kim et al. (2006) TA-RE: An exchange language for mining software repositories, Proc. 3rd Intl. Workshop on Mining Software Repositories, ACM.
- Van Antwerp and Madey (2008) Advances in the SourceForge Research Data Archive (SRDA). In: Electronic Proc. of the Workshops of the 4th Int’l Conf. on Open Source Systems. Presented at the Workshop on Public Data about Software Development.
- Tools: the FAMIX meta-model, the Rigi Standard Format, the CrocoPat relational calculator, the FETCH tool chain, the integrated reverse engineering environment Moose, the collaborative tool Churrasco, the Taverna workflow workbench