Journal articles on the topic 'Mining Source Code Repositories'

To see the other types of publications on this topic, follow the link: Mining Source Code Repositories.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Mining Source Code Repositories.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Williams, C. C., and J. K. Hollingsworth. "Automatic mining of source code repositories to improve bug finding techniques." IEEE Transactions on Software Engineering 31, no. 6 (June 2005): 466–80. http://dx.doi.org/10.1109/tse.2005.63.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Kagdi, Huzefa, Michael L. Collard, and Jonathan I. Maletic. "Towards a taxonomy of approaches for mining of source code repositories." ACM SIGSOFT Software Engineering Notes 30, no. 4 (July 2005): 1–5. http://dx.doi.org/10.1145/1082983.1083159.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

M. Ishag, Musa Ibrahim, Hyun Woo Park, Dingkun Li, and Keun Ho Ryu. "Highlighting Current Issues in API Usage Mining to Enhance Software Reusability." WSEAS TRANSACTIONS ON COMPUTER RESEARCH 10 (March 22, 2022): 29–34. http://dx.doi.org/10.37394/232018.2022.10.4.

Full text
Abstract:
The sheer amount of open source codes made available in code repositories and code search engines along with the rapidly increasing releases of Application Programming Interfaces (APIs) have made code devel- opment process easier for programmers. However, learning how to use the elements of an API properly is both challenging and requires learning curve. Mining the available client and test codes can help programmers to iden- tify the best practices in using these APIs. In this paper, we investigate the API usage mining to identify open issues for the researchers. In particular, we make a theoretical comparison of the API usage pattern mining and highlight unresolved issues along with proper suggestions to address them.
APA, Harvard, Vancouver, ISO, and other styles
4

Sun, Xiaobing, Bin Li, Yucong Duan, Wei Shi, and Xiangyue Liu. "Mining Software Repositories for Automatic Interface Recommendation." Scientific Programming 2016 (2016): 1–11. http://dx.doi.org/10.1155/2016/5475964.

Full text
Abstract:
There are a large number of open source projects in software repositories for developers to reuse. During software development and maintenance, developers can leverage good interfaces in these open source projects and establish the framework of the new project quickly when reusing interfaces in these open source projects. However, if developers want to reuse them, they need to read a lot of code files and learn which interfaces can be reused. To help developers better take advantage of the available interfaces used in software repositories, we previously proposed an approach to automatically recommend interfaces by mining existing open source projects in the software repositories. We mainly used the LDA (Latent Dirichlet Allocation) topic model to construct the Feature-Interface Graph for each software project and recommended the interfaces based on the Feature-Interface Graph. In this paper, we improve our previous approach by clustering the recommending interfaces on the Feature-Interface Graph, which can recommend more accurate interfaces for developers to reuse. We evaluate the effectiveness of the improved approach and the results show that the improved approach can be more efficient to recommend more accurate interfaces for reuse over our previous work.
APA, Harvard, Vancouver, ISO, and other styles
5

Pinzger, Martin, Emanuel Giger, and Harald C. Gall. "Comparing fine-grained source code changes and code churn for bug prediction - A retrospective." ACM SIGSOFT Software Engineering Notes 46, no. 3 (July 14, 2021): 21–23. http://dx.doi.org/10.1145/3468744.3468751.

Full text
Abstract:
More than two decades ago, researchers started to mine the data stored in software repositories to help software developers in making informed decisions for developing and testing software systems. Bug prediction was one of the most promising and popular research directions that uses the data stored in software repositories to predict the bug-proneness or number of bugs in source files. On that topic and as part of Emanuel's PhD studies, we submitted a paper with the title Comparing fine-grained source code changes and code churn for bug prediction [8] to the 8th Working Conference on Mining Software Engineering, held 2011 in beautiful Honolulu, Hawaii. Ten years later, it got selected as one of the finalists to receive the MSR 2021 Most Influential Paper Award. In the following, we provide a retrospective on our work, describing the road to publishing this paper, its impact in the field of bug prediction, and the road ahead.
APA, Harvard, Vancouver, ISO, and other styles
6

Nugroho, Yusuf Sulistyo, Hideaki Hata, and Kenichi Matsumoto. "How different are different diff algorithms in Git?" Empirical Software Engineering 25, no. 1 (September 11, 2019): 790–823. http://dx.doi.org/10.1007/s10664-019-09772-z.

Full text
Abstract:
Abstract Automatic identification of the differences between two versions of a file is a common and basic task in several applications of mining code repositories. Git, a version control system, has a diff utility and users can select algorithms of diff from the default algorithm Myers to the advanced Histogram algorithm. From our systematic mapping, we identified three popular applications of diff in recent studies. On the impact on code churn metrics in 14 Java projects, we obtained different values in 1.7% to 8.2% commits based on the different diff algorithms. Regarding bug-introducing change identification, we found 6.0% and 13.3% in the identified bug-fix commits had different results of bug-introducing changes from 10 Java projects. For patch application, we found that the Histogram is more suitable than Myers for providing the changes of code, from our manual analysis. Thus, we strongly recommend using the Histogram algorithm when mining Git repositories to consider differences in source code.
APA, Harvard, Vancouver, ISO, and other styles
7

SCOTTO, MARCO, ALBERTO SILLITTI, and GIANCARLO SUCCI. "AN EMPIRICAL ANALYSIS OF THE OPEN SOURCE DEVELOPMENT PROCESS BASED ON MINING OF SOURCE CODE REPOSITORIES." International Journal of Software Engineering and Knowledge Engineering 17, no. 02 (April 2007): 231–47. http://dx.doi.org/10.1142/s0218194007003215.

Full text
Abstract:
This paper presents an empirical analysis of the Open Source development process from the point of view of the involvement of the developers in the production process. The study focuses on how developers contribute to projects in terms of involvement, size and kind of their contribution. Data have been collected from 53 Open Source projects and target application domains include different areas: web and application servers, databases, operating systems, and window managers. Collected data include the number of developers, patterns of code modifications, and evolution over the time of size and complexity. The results of this study show evidence that there are recurrent patterns in Open Source software development and these patterns are common to all the projects considered even if there are no superimposed processes for development, application domains are different, and there are contributions from people spread across the world.
APA, Harvard, Vancouver, ISO, and other styles
8

Saini, Munish, Sandeep Mehmi, and Kuljit Kaur Chahal. "Understanding Open Source Software Evolution Using Fuzzy Data Mining Algorithm for Time Series Data." Advances in Fuzzy Systems 2016 (2016): 1–13. http://dx.doi.org/10.1155/2016/1479692.

Full text
Abstract:
Source code management systems (such as Concurrent Versions System (CVS), Subversion, and git) record changes to code repositories of open source software projects. This study explores a fuzzy data mining algorithm for time series data to generate the association rules for evaluating the existing trend and regularity in the evolution of open source software project. The idea to choose fuzzy data mining algorithm for time series data is due to the stochastic nature of the open source software development process. Commit activity of an open source project indicates the activeness of its development community. An active development community is a strong contributor to the success of an open source project. Therefore commit activity analysis along with the trend and regularity analysis for commit activity of open source software project acts as an important indicator to the project managers and analyst regarding the evolutionary prospects of the project in the future.
APA, Harvard, Vancouver, ISO, and other styles
9

Schreiber, Roland Robert. "Organizational Influencers in Open-Source Software Projects." International Journal of Open Source Software and Processes 14, no. 1 (February 16, 2023): 1–20. http://dx.doi.org/10.4018/ijossp.318400.

Full text
Abstract:
Traditional software development is shifting toward the open-source development model, particularly in the current environment of competitive challenges to develop software openly. The author employs a case study approach to investigate how organizations and their affiliated developers collaborate in the open-source software (OSS) ecosystem TensorFlow (TF). The analysis of the artificial intelligence OSS library TF combines social network analysis (SNA) and an examination of archival data by mining software repositories. The study looks at the structure and evolution of code-collaboration among developers and with the ecosystem's organizational networks over the TF lifespan. These involved organizations play a particularly critical role in development. The research also looks at productivity, homophily, development, and diversity among developers. The results deepen the understanding of OSS communities' collaborative developer and organization patterns. Furthermore, the study emphasizes the importance and evolution of social networks, diversity, and productivity in ecosystems.
APA, Harvard, Vancouver, ISO, and other styles
10

Lu, Mingming, Yan Liu, Haifeng Li, Dingwu Tan, Xiaoxian He, Wenjie Bi, and Wendbo Li. "Hyperbolic Function Embedding: Learning Hierarchical Representation for Functions of Source Code in Hyperbolic Space." Symmetry 11, no. 2 (February 18, 2019): 254. http://dx.doi.org/10.3390/sym11020254.

Full text
Abstract:
Recently, source code mining has received increasing attention due to the rapid increase of open-sourced code repositories and the tremendous values implied in this large dataset, which can help us understand the organization of functions or classes in different software and analyze the impact of these organized patterns on the software behaviors. Hence, learning an effective representation model for the functions of source code, from a modern view, is a crucial problem. Considering the inherent hierarchy of functions, we propose a novel hyperbolic function embedding (HFE) method, which can learn a distributed and hierarchical representation for each function via the Poincaré ball model. To achieve this, a function call graph (FCG) is first constructed to model the call relationship among functions. To verify the underlying geometry of FCG, the Ricci curvature model is used. Finally, an HFE model is built to learn the representations that can capture the latent hierarchy of functions in the hyperbolic space, instead of the Euclidean space, which are usually used in those state-of-the-art methods. Moreover, HFE is more compact in terms of lower dimensionality than the existing graph embedding methods. Thus, HFE is more effective in terms of computation and storage. To experimentally evaluate the performance of HFE, two application scenarios, namely, function classification and link prediction, have been applied. HFE achieves up to 7.6% performance improvement compared to the chosen state-of-the-art methods, namely, Node2vec and Struc2vec.
APA, Harvard, Vancouver, ISO, and other styles
11

Mitsyuk, Alexey Alexandrovich, and Nikolay Arsenovich Jamgaryan. "What Software Architecture Styles are Popular?" Proceedings of the Institute for System Programming of the RAS 33, no. 3 (2021): 7–26. http://dx.doi.org/10.15514/ispras-2021-33(3)-1.

Full text
Abstract:
One can meet the software architecture style's notion in the software engineering literature. This notion is considered important in books on software architecture and university sources. However, many software developers are not so optimistic about it. It is not clear, whether this notion is just an academic concept, or is actually used in the software industry. In this paper, we measured industrial software developers' attitudes towards the concept of software architecture style. We also investigated the popularity of eleven concrete architecture styles. We applied two methods. A developers’ survey was applied to estimate developers' overall attitude and define what the community thinks about the automatic recognition of software architecture styles. Automatic crawlers were applied to mine the open-source code from the GitHub platform. These crawlers identified style smells in repositories using the features we proposed for the architecture styles. We found that the notion of software architecture style is not just a concept of academics in universities. Many software developers apply this concept in their work. We formulated features for the eleven concrete software architecture styles and developed crawlers based on these features. The results of repository mining using the features showed which styles are popular among developers of open-source projects from commercial companies and non-commercial communities. Automatic mining results were additionally validated by the Github developers survey.
APA, Harvard, Vancouver, ISO, and other styles
12

Herrmann, Frank, Robert Horkovics-Kovats M. Sc., and Eldar Dr. Sultanow. "Ein paralleler Algorithmus für API Mining von C# Code." Anwendungen und Konzepte der Wirtschaftsinformatik, no. 16 (December 24, 2022): 9. http://dx.doi.org/10.26034/lu.akwi.2022.3454.

Full text
Abstract:
Die Konformitätsanalyse ist eine Technik der statischen Code-Analyse (SCA) zur Software-Qualitätssicherung. Ihr Kernproblem ist, dass Werkzeuge nicht aus bereits eingetretenen Fehlern automatisiert dazulernen. Zur Lösung wurde in dieser Arbeit das maschinelle Lernen (ML) evaluiert, indem ein wissenschaftlich fundierter und praktisch erprobter Ansatz zur unüberwachten Lerntechnik angewandt und das Ergebnis analysiert wurde. Es wurde festgestellt, dass zur Anwendung auf verschiedene Programmiersprachen nur ein sprachspezifisches API Mining-Tool notwendig ist. Ein derartiges Tool durchsucht in parallelisierter Form Codezeilen und normalisiert sie für maschinelle Lernprozesse. Dieses System wurde für die Programmiersprache C# implementiert, da viele Industrieprojekte in dieser Sprache entwickelt werden. Zur funktionalen Validierung wurde in einer Fallstudie gezeigt, dass Regeln mit einem positiven Effekt auf Software-Qualität gelernt wurden. Konkret wurde der Wartungsaufwand eines Code-Smells in einem Beispielprojekt durch das Auslagern einer gelernten Assoziation in eine gemeinsame Methode um den Faktor 30 reduziert. Die Laufzeit des Algorithmus wurde empirisch in acht open-source Repositorys evaluiert. Durch Parallelisierung kann eine durchschnittliche Laufzeitverbesserung von 45,16% erwartet werden. Allerdings wurden bei der Anwendung auch Grenzen deutlich: Viele Assoziationen sind nutzlos, die Regelbewertung ist von einem subjektiven Faktor abhängig und die Wirtschaftlichkeit des Tools ist deshalb nicht transparent. Dennoch belegt diese Arbeit, dass ein ML-basiertes SCA-Tool als ergänzende Qualitätssicherungsmaßnahme im Software-Engineering möglich ist.
APA, Harvard, Vancouver, ISO, and other styles
13

Siegenthaler, Michael, and Hakim Weatherspoon. "Cloudifying source code repositories." ACM SIGOPS Operating Systems Review 44, no. 2 (April 14, 2010): 24–28. http://dx.doi.org/10.1145/1773912.1773919.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Alomari, Firas, and Muhammed Harbi. "Scalable Source Code Similarity Detection in Large Code Repositories." ICST Transactions on Scalable Information Systems 6, no. 22 (July 29, 2019): 159353. http://dx.doi.org/10.4108/eai.13-7-2018.159353.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Muhammad Shumail Naveed. "Correlation Between GitHub Stars and Code Vulnerabilities." Journal of Computing & Biomedical Informatics 4, no. 01 (December 29, 2022): 141–51. http://dx.doi.org/10.56979/401/2022/111.

Full text
Abstract:
In the software industry, open-source repositories are widely utilized to speed up software development. GitHub is a big source of open-source repositories and offers users to star the code repository. Stars are used in GitHub to represent appreciation and popularity. Studies have revealed that repositories may be of lower quality and may have vulnerabilities that hackers may exploit. It is not known whether the popularity of the GitHub repositories in terms of stars confirms the security and invulnerability of the program code. This paper analyzed the correlation between stars of GitHub’s code repositories and the vulnerabilities in their code by using static code analyzer. The study examined the vulnerabilities in ten popular C++ source repositories on GitHub and discovered 3487 vulnerabilities in the dataset, which were split into four categories based on severity. There was not a single repository in the dataset that was free of flaws. On the detected vulnerabilities, a Kruskal-Wallis H test reveals a significant difference between the different code repositories of the dataset. The Spearman's rank correlation coefficient test found no correlation between repositories’ stars and the frequency of vulnerabilities, implying that the popularity of code repositories on GitHub in terms of high star ratings does not imply their security integrity. Overall, the findings suggest that code repositories should be thoroughly evaluated before being used in software development. The novelty of this paper resides in the development of new knowledge as well as the study pattern that may be used to other investigations.
APA, Harvard, Vancouver, ISO, and other styles
16

Raducu, Razvan, Gonzalo Esteban, Francisco J. Rodríguez Lera, and Camino Fernández. "Collecting Vulnerable Source Code from Open-Source Repositories for Dataset Generation." Applied Sciences 10, no. 4 (February 13, 2020): 1270. http://dx.doi.org/10.3390/app10041270.

Full text
Abstract:
Different Machine Learning techniques to detect software vulnerabilities have emerged in scientific and industrial scenarios. Different actors in these scenarios aim to develop algorithms for predicting security threats without requiring human intervention. However, these algorithms require data-driven engines based on the processing of huge amounts of data, known as datasets. This paper introduces the SonarCloud Vulnerable Code Prospector for C (SVCP4C). This tool aims to collect vulnerable source code from open source repositories linked to SonarCloud, an online tool that performs static analysis and tags the potentially vulnerable code. The tool provides a set of tagged files suitable for extracting features and creating training datasets for Machine Learning algorithms. This study presents a descriptive analysis of these files and overviews current status of C vulnerabilities, specifically buffer overflow, in the reviewed public repositories.
APA, Harvard, Vancouver, ISO, and other styles
17

Hu, Gang, Min Peng, Yihan Zhang, Qianqian Xie, Wang Gao, and Mengting Yuan. "Unsupervised software repositories mining and its application to code search." Software: Practice and Experience 50, no. 3 (November 21, 2019): 299–322. http://dx.doi.org/10.1002/spe.2760.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Xu, Aiqiao. "Software Engineering Code Workshop Based on B-RRT ∗ FND Algorithm for Deep Program Understanding Perspective." Journal of Sensors 2022 (September 26, 2022): 1–11. http://dx.doi.org/10.1155/2022/1564178.

Full text
Abstract:
Developers will perform a lot of search behaviors when facing daily work tasks, searching for reusable code fragments, solutions to specific problems, algorithm designs, software documentation, and software tools from public repositories (including open source communities and forum blogs) or private repositories (internal software repositories, source code platforms, communities, etc.) to make full use of existing software development resources and experiences. This paper first takes a deep programmatic understanding view of the software development process. In this paper, we first define the software engineering code search task from the perspective of deep program understanding. Secondly, this paper summarizes two research paradigms of deep software engineering code search and composes the related research results. At the same time, this paper summarizes and organizes the common evaluation methods for software engineering code search tasks. Finally, the results of this paper are combined with an outlook on future research.
APA, Harvard, Vancouver, ISO, and other styles
19

Haber, Casey, and Robert Gove. "A Visualization Tool for Analyzing the Suitability of Software Libraries via Their Code Repositories." Electronic Imaging 2020, no. 1 (January 26, 2020): 387–1. http://dx.doi.org/10.2352/issn.2470-1173.2020.1.vda-387.

Full text
Abstract:
Code repositories are a common way to archive software source code files. Understanding code repository content and history is important but can be difficult due to the complexity of code repositories. Most available tools are designed for users who are actively maintaining a code repository. In contrast, external developers need to assess the suitability of using a software library, e.g. whether its code repository has a healthy level of maintenance, and how much risk the external developers face if they depend on that code in their own project. In this paper, we identify six risks associated with using a software library, we derive seven requirements for tools to assess these risks, and we contribute two dashboard designs derived from these requirements. The first dashboard is designed to assess a software library's usage suitability via its code repository, and the second dashboard visually compares usage suitability information about multiple software libraries' code repositories. Using four popular libraries' code repositories, we show that these dashboards are effective for understanding and comparing key aspects of software library usage suitability. We further compare our dashboard to a typical code repository user interface and show that our dashboard is more succinct and requires less work.
APA, Harvard, Vancouver, ISO, and other styles
20

Khatoon, Shaheen, Guohui Li, and Rana Muhammad Ashfaq. "A Framework for Automatically Mining Source Code." Journal of Software Engineering 5, no. 2 (March 15, 2011): 64–77. http://dx.doi.org/10.3923/jse.2011.64.77.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Gonzalez-Barahona, Jesus M., Daniel Izquierdo-Cortazar, and Megan Squire. "Repositories with Public Data about Software Development." International Journal of Open Source Software and Processes 2, no. 2 (April 2010): 1–13. http://dx.doi.org/10.4018/jossp.2010040101.

Full text
Abstract:
Empirical research on software development based on data obtained from project repositories and code forges is increasingly gaining attention in the software engineering research community. The studies in this area typically start by retrieving or monitoring some subset of data found in the repository or forge, and this data is later analyzed to find interesting patterns. However, retrieving information from these locations can be a challenging task. Meta-repositories providing public information about software development are useful tools that can simplify and streamline the research process. Public data repositories that collect and clean the data from other project repositories or code forges can help ensure that research studies are based on good quality data. This paper provides some insight as to how these meta-repositories (sometimes called a “repository of repositories”, RoR) of data about open source projects should be used to help researchers. This paper describes in detail two of the most widely used collections of data about software development: FLOSSmole and FLOSSMetrics.
APA, Harvard, Vancouver, ISO, and other styles
22

Xu, Rongze, Zhanyong Tang, Guixin Ye, Huanting Wang, Xin Ke, Dingyi Fang, and Zheng Wang. "Detecting code vulnerabilities by learning from large-scale open source repositories." Journal of Information Security and Applications 69 (September 2022): 103293. http://dx.doi.org/10.1016/j.jisa.2022.103293.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Zhang, Feng, Lulu Li, Cong Liu, and Qingtian Zeng. "Flow Chart Generation-Based Source Code Similarity Detection Using Process Mining." Scientific Programming 2020 (July 7, 2020): 1–15. http://dx.doi.org/10.1155/2020/8865413.

Full text
Abstract:
Source code similarity detection has extensive applications in computer programming teaching and software intellectual property protection. In the teaching of computer programming courses, students may utilize some complex source code obfuscation techniques, e.g., opaque predicates, loop unrolling, and function inlining and outlining, to reduce the similarity between code fragments and avoid the plagiarism detection. Existing source code similarity detection approaches only consider static features of source code, making it difficult to cope with more complex code obfuscation techniques. In this paper, we propose a novel source code similarity detection approach by considering the dynamic features at runtime of source code using process mining. More specifically, given two pieces of source code, their running logs are obtained by source code instrumentation and execution. Next, process mining is used to obtain the flow charts of the two pieces of source code by analyzing their collected running logs. Finally, similarity of the two pieces of source code is measured by computing the similarity of these two flow charts. Experimental results show that the proposed approach can deal with more complex obfuscation techniques including opaque predicates and loop unrolling as well as function inlining and outlining, which cannot be handled by existing work properly. Therefore, we argue that our approach can defeat commonly used code obfuscation techniques more effectively for source code similarity detection than the existing state-of-the-art approaches.
APA, Harvard, Vancouver, ISO, and other styles
24

YU, Xiu-mei, Bin LIANG, and Hong CHEN. "Survey on applications of software source code mining." Journal of Computer Applications 29, no. 9 (October 30, 2009): 2494–98. http://dx.doi.org/10.3724/sp.j.1087.2009.02494.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Ying, A. T. T., G. C. Murphy, R. Ng, and M. C. Chu-Carroll. "Predicting source code changes by mining change history." IEEE Transactions on Software Engineering 30, no. 9 (September 2004): 574–86. http://dx.doi.org/10.1109/tse.2004.52.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Rasekh, Amir Hossein, Seyed Mostafa Fakhrahmad, and Mohammad Hadi Sadreddini. "Mining traces between source code and textual documents." International Journal of Computer Applications in Technology 59, no. 1 (2019): 43. http://dx.doi.org/10.1504/ijcat.2019.097116.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Sadreddini, Mohammad Hadi, Seyed Mostafa Fakhrahmad, and Amir Hossein Rasekh. "Mining traces between source code and textual documents." International Journal of Computer Applications in Technology 59, no. 1 (2019): 43. http://dx.doi.org/10.1504/ijcat.2019.10018167.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Kacmajor, Magdalena, and John Kelleher. "Automatic Acquisition of Annotated Training Corpora for Test-Code Generation." Information 10, no. 2 (February 17, 2019): 66. http://dx.doi.org/10.3390/info10020066.

Full text
Abstract:
Open software repositories make large amounts of source code publicly available. Potentially, this source code could be used as training data to develop new, machine learning-based programming tools. For many applications, however, raw code scraped from online repositories does not constitute an adequate training dataset. Building on the recent and rapid improvements in machine translation (MT), one possibly very interesting application is code generation from natural language descriptions. One of the bottlenecks in developing these MT-inspired systems is the acquisition of parallel text-code corpora required for training code-generative models. This paper addresses the problem of automatically synthetizing parallel text-code corpora in the software testing domain. Our approach is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular for unit testing. Therefore, we propose synthesizing parallel corpora comprised of parsed test function names serving as code descriptions, aligned with the corresponding function bodies. We present the results of applying one of the state-of-the-art MT methods on such a generated dataset. Our experiments show that a neural MT model trained on our dataset can generate syntactically correct and semantically relevant short Java functions from quasi-natural language descriptions of functionality.
APA, Harvard, Vancouver, ISO, and other styles
29

Hammad, Muhammad, Önder Babur, Hamid Abdul Basit, and Mark van den Brand. "Clone-advisor: recommending code tokens and clone methods with deep learning and information retrieval." PeerJ Computer Science 7 (November 9, 2021): e737. http://dx.doi.org/10.7717/peerj-cs.737.

Full text
Abstract:
Software developers frequently reuse source code from repositories as it saves development time and effort. Code clones (similar code fragments) accumulated in these repositories represent often repeated functionalities and are candidates for reuse in an exploratory or rapid development. To facilitate code clone reuse, we previously presented DeepClone, a novel deep learning approach for modeling code clones along with non-cloned code to predict the next set of tokens (possibly a complete clone method body) based on the code written so far. The probabilistic nature of language modeling, however, can lead to code output with minor syntax or logic errors. To resolve this, we propose a novel approach called Clone-Advisor. We apply an information retrieval technique on top of DeepClone output to recommend real clone methods closely matching the predicted clone method, thus improving the original output by DeepClone. In this paper we have discussed and refined our previous work on DeepClone in much more detail. Moreover, we have quantitatively evaluated the performance and effectiveness of Clone-Advisor in clone method recommendation.
APA, Harvard, Vancouver, ISO, and other styles
30

Canaparo, Marco, and Elisabetta Ronchieri. "Data Mining Techniques for Software Quality Prediction in Open Source Software." EPJ Web of Conferences 214 (2019): 05007. http://dx.doi.org/10.1051/epjconf/201921405007.

Full text
Abstract:
Software quality monitoring and analysis are among the most productive topics in software engineering research. Their results may be effectively employed by engineers during software development life cycle. Open source software constitutes a valid test case for the assessment of software characteristics. The data mining approach has been proposed in literature to extract software characteristics from software engineering data. This paper aims at comparing diverse data mining techniques (e.g., derived from machine learning) for developing effective software quality prediction models. To achieve this goal, we tackled various issues, such as the collection of software metrics from open source repositories, the assessment of prediction models to detect software issues and the adoption of statistical methods to evaluate data mining techniques. The results of this study aspire to identify the data mining techniques that perform better amongst all the ones used in this paper for software quality prediction models.
APA, Harvard, Vancouver, ISO, and other styles
31

Hamdy, Abeer, and Abdelrahman E. Arabi. "Locating Faulty Source Code Files to Fix Bug Reports." International Journal of Open Source Software and Processes 13, no. 1 (January 1, 2022): 1–15. http://dx.doi.org/10.4018/ijossp.308791.

Full text
Abstract:
Open source software is usually released while it still contains bugs. In order to fix a reported bug during maintenance phase, the developer has to search the source code files to identify the faulty ones; this process is called bug localization (BL). Automating BL is a necessity to boost the developer's productivity and enhance the software quality. The paper proposes an information retrieval based approach for retrieving and ranking a list of suspicious faulty source files relevant to a submitted bug report (BR). The proposed approach leverages textual features of the BRs and source files, which are parts-of-speech tagging, lexical and semantic similarity between the source files and BRs, in addition to the source file change history. The effectiveness of the proposed approach was evaluated over three open-source software repositories. Experimental results showed the superiority of the proposed approach over eight previous approaches in terms of top@N and MAP metrics.
APA, Harvard, Vancouver, ISO, and other styles
32

Thomson, Patrick, Rob Rix, Nicolas Wu, and Tom Schrijvers. "Fusing industry and academia at GitHub (experience report)." Proceedings of the ACM on Programming Languages 6, ICFP (August 29, 2022): 496–511. http://dx.doi.org/10.1145/3547639.

Full text
Abstract:
GitHub hosts hundreds of millions of code repositories written in hundreds of different programming languages. In addition to its hosting services, GitHub provides data and insights into code, such as vulnerability analysis and code navigation, with which users can improve and understand their software development process. GitHub has built Semantic, a program analysis tool capable of parsing and extracting detailed information from source code. The development of Semantic has relied extensively on the functional programming literature; this paper describes how connections to academic research inspired and informed the development of an industrial-scale program analysis toolkit.
APA, Harvard, Vancouver, ISO, and other styles
33

Domenichelli, Daniele E., Silvio Traversaro, Luca Muratore, Alessio Rocchi, Francesco Nori, and Lorenzo Natale. "A Build System for Software Development in Robotic Academic Collaborative Environments." International Journal of Semantic Computing 13, no. 02 (June 2019): 185–205. http://dx.doi.org/10.1142/s1793351x19400087.

Full text
Abstract:
The software development cycle in the robotic research environment is hectic and heavily driven by project or paper deadlines. Developers have only little time available for packaging the C/C++ code they write, develop and maintain the build system and continuous integration tools. Research projects are joint efforts of different groups working remotely and asynchronously. The typical solution is to rely on binary distributions and/or large repositories that compile all software and dependencies. This approach hinders code sharing and reuse and often leads to repositories whose inter-dependencies are difficult to manage. Following many years of experience leading software integration in research projects, we developed YCM, a tool that supports our best practices addressing these issues. YCM is a set of CMake scripts that provides (1) build system support: to develop and package software libraries and components, and (2) superbuild deployment: to prepare and distribute sets of packages in source form as a single meta build. In this paper, we describe YCM and report on our experience adopting it as a tool for managing software repositories in large research projects.
APA, Harvard, Vancouver, ISO, and other styles
34

Pritikin, Joshua N., and Carl F. Falk. "OpenMx: A Modular Research Environment for Item Response Theory Method Development." Applied Psychological Measurement 44, no. 7-8 (June 13, 2020): 561–62. http://dx.doi.org/10.1177/0146621620929431.

Full text
Abstract:
There are many item response theory software packages designed for users. Here, the authors introduce an environment tailored to method development and simulation. Implementations of a selection of classic algorithms are available as well as some recently developed methods. Source code is developed in public repositories on GitHub; your collaboration is welcome.
APA, Harvard, Vancouver, ISO, and other styles
35

Khwaldeh, Ali, Amani Tahat, Jordi Marti, and Mofleh Tahat. "Atomic Data Mining Numerical Methods, Source Code SQlite with Python." Procedia - Social and Behavioral Sciences 73 (February 2013): 232–39. http://dx.doi.org/10.1016/j.sbspro.2013.02.046.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

AlMarzouq, Mohammad, Abdullatif AlZaidan, and Jehad AlDallal. "Mining GitHub for research and education: challenges and opportunities." International Journal of Web Information Systems 16, no. 4 (June 29, 2020): 451–73. http://dx.doi.org/10.1108/ijwis-03-2020-0016.

Full text
Abstract:
Purpose This study aims to highlight the challenges and opportunities of using GitHub as a data source in both research and programming education. Design/methodology/approach This study provides general overview of the challenges and opportunities faced while conducting empirical research using GitHub as a data source. The challenges and opportunities are framed using the input–process–output model of open-source software. Findings GitHub data accessed from the application programming interface (API) can have several limitations, which can be overcome by Web scraping and using external data repositories such as GHArchive and GHTorrent. There are also several idiosyncrasies about GitHub that researchers need to be aware of to be able to use the data effectively, which can represent an opportunity for research. The challenges and opportunities are summarized for the licenses, community, development process and product of free/libra and open-source software communities hosted on GitHub. Originality/value This study provides a summary of GitHub-related challenges and opportunities that researchers can leverage to improve their empirical research. Furthermore, this summary can be a valuable resource for instructors that plan to use GitHub as a data source in their data-focused programming courses.
APA, Harvard, Vancouver, ISO, and other styles
37

Busby, Ben, Matthew Lesko, and Lisa Federer. "Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping." F1000Research 5 (April 13, 2016): 672. http://dx.doi.org/10.12688/f1000research.8382.1.

Full text
Abstract:
In genomics, bioinformatics and other areas of data science, gaps exist between extant public datasets and the open-source software tools built by the community to analyze similar data types. The purpose of biological data science hackathons is to assemble groups of genomics or bioinformatics professionals and software developers to rapidly prototype software to address these gaps. The only two rules for the NCBI-assisted hackathons run so far are that 1) data either must be housed in public data repositories or be deposited to such repositories shortly after the hackathon’s conclusion, and 2) all software comprising the final pipeline must be open-source or open-use. Proposed topics, as well as suggested tools and approaches, are distributed to participants at the beginning of each hackathon and refined during the event. Software, scripts, and pipelines are developed and published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development. The code resulting from each hackathon is published at https://github.com/NCBI-Hackathons/ with separate directories or repositories for each team.
APA, Harvard, Vancouver, ISO, and other styles
38

Busby, Ben, Matthew Lesko, and Lisa Federer. "Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping." F1000Research 5 (May 9, 2016): 672. http://dx.doi.org/10.12688/f1000research.8382.2.

Full text
Abstract:
In genomics, bioinformatics and other areas of data science, gaps exist between extant public datasets and the open-source software tools built by the community to analyze similar data types. The purpose of biological data science hackathons is to assemble groups of genomics or bioinformatics professionals and software developers to rapidly prototype software to address these gaps. The only two rules for the NCBI-assisted hackathons run so far are that 1) data either must be housed in public data repositories or be deposited to such repositories shortly after the hackathon’s conclusion, and 2) all software comprising the final pipeline must be open-source or open-use. Proposed topics, as well as suggested tools and approaches, are distributed to participants at the beginning of each hackathon and refined during the event. Software, scripts, and pipelines are developed and published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development. The code resulting from each hackathon is published at https://github.com/NCBI-Hackathons/ with separate directories or repositories for each team.
APA, Harvard, Vancouver, ISO, and other styles
39

Rhmann, Wasiur, and Gufran Ahmad Ansari. "Ensemble Techniques-Based Software Fault Prediction in an Open-Source Project." International Journal of Open Source Software and Processes 11, no. 2 (April 2020): 33–48. http://dx.doi.org/10.4018/ijossp.2020040103.

Full text
Abstract:
Software engineering repositories have been attracted by researchers to mine useful information about the different quality attributes of the software. These repositories have been helpful to software professionals to efficiently allocate various resources in the life cycle of software development. Software fault prediction is a quality assurance activity. In fault prediction, software faults are predicted before actual software testing. As exhaustive software testing is impossible, the use of software fault prediction models can help the proper allocation of testing resources. Various machine learning techniques have been applied to create software fault prediction models. In this study, ensemble models are used for software fault prediction. Change metrics-based data are collected for an open-source android project from GIT repository and code-based metrics data are obtained from PROMISE data repository and datasets kc1, kc2, cm1, and pc1 are used for experimental purpose. Results showed that ensemble models performed better compared to machine learning and hybrid search-based algorithms. Bagging ensemble was found to be more effective in the prediction of faults in comparison to soft and hard voting.
APA, Harvard, Vancouver, ISO, and other styles
40

Bibi, Nazia, Tauseef Rana, Ayesha Maqbool, Farkhanda Afzal, Ali Akgül, and Manuel De la Sen. "An Intelligent Platform for Software Component Mining and Retrieval." Sensors 23, no. 1 (January 3, 2023): 525. http://dx.doi.org/10.3390/s23010525.

Full text
Abstract:
The development of robotic applications necessitates the availability of useful, adaptable, and accessible programming frameworks. Robotic, IoT, and sensor-based systems open up new possibilities for the development of innovative applications, taking advantage of existing and new technologies. Despite much progress, the development of these applications remains a complex, time-consuming, and demanding activity. Development of these applications requires wide utilization of software components. In this paper, we propose a platform that efficiently searches and recommends code components for reuse. To locate and rank the source code snippets, our approach uses a machine learning approach to train the schema. Our platform uses trained schema to rank code snippets in the top k results. This platform facilitates the process of reuse by recommending suitable components for a given query. The platform provides a user-friendly interface where developers can enter queries (specifications) for code search. The evaluation shows that our platform effectively ranks the source code snippets and outperforms existing baselines. A survey is also conducted to affirm the viability of the proposed methodology.
APA, Harvard, Vancouver, ISO, and other styles
41

Eiroa-Lledo, Elia, Rao Hamza Ali, Gabriela Pinto, Jillian Anderson, and Erik Linstead. "Large-Scale Identification and Analysis of Factors Impacting Simple Bug Resolution Times in Open Source Software Repositories." Applied Sciences 13, no. 5 (February 28, 2023): 3150. http://dx.doi.org/10.3390/app13053150.

Full text
Abstract:
One of the most prominent issues the ever-growing open-source software community faces is the abundance of buggy code. Well-established version control systems and repository hosting services such as GitHub and Maven provide a checks-and-balances structure to minimize the amount of buggy code introduced. Although these platforms are effective in mitigating the problem, it still remains. To further the efforts toward a more effective and quicker response to bugs, we must understand the factors that affect the time it takes to fix one. We apply a custom traversal algorithm to commits made for open source repositories to determine when “simple stupid bugs” were first introduced to projects and explore the factors that drive the time it takes to fix them. Using the commit history from the main development branch, we are able to identify the commit that first introduced 13 different types of simple stupid bugs in 617 of the top Java projects on GitHub. Leveraging a statistical survival model and other non-parametric statistical tests, we found that there were two main categories of categorical variables that affect a bug’s life; Time Factors and Author Factors. We find that bugs are fixed quicker if they are introduced and resolved by the same developer. Further, we discuss how the day of the week and time of day a buggy code was written and fixed affects its resolution time. These findings will provide vital insight to help the open-source community mitigate the abundance of code and can be used in future research to aid in bug-finding programs.
APA, Harvard, Vancouver, ISO, and other styles
42

Alarcon, Gene M., Anthony M. Gibson, Charles Walter, Rose F. Gamble, Tyler J. Ryan, Sarah A. Jessup, Brian E. Boyd, and August Capiola. "Trust Perceptions of Metadata in Open-Source Software: The Role of Performance and Reputation." Systems 8, no. 3 (August 12, 2020): 28. http://dx.doi.org/10.3390/systems8030028.

Full text
Abstract:
Open-source software (OSS) is a key aspect of software creation. However, little is known about programmers’ decisions to trust software from OSS websites. The current study emulated OSS websites and manipulated reputation and performance factors in the stimuli according to the heuristic-systematic processing model. We sampled professional programmers—with a minimum experience of three years—from Amazon Mechanical Turk (N = 38). We used a 3 × 3 within-subjects design to investigate the relationship between OSS reputation and performance on users’ time spent on code, the number of interface clicks, trustworthiness perceptions, and willingness to use OSS code. We found that participants spent more time on and clicked the interface more often for code that was high in reputation. Meta-information included with OSS tools was found to affect the degree to which computer programmers interact with and perceive online code repositories. Furthermore, participants reported higher levels of perceived trustworthiness in and trust toward highly reputable OSS code. Notably, we observed fewer significant main effects for the performance manipulation, which may correspond to participants considering performance attributes mainly within the context of reputation-relevant information. That is, the degree to which programmers investigate and then trust OSS code may depend on the initial reputation ratings.
APA, Harvard, Vancouver, ISO, and other styles
43

Escalada, Javier, Francisco Ortin, and Ted Scully. "An Efficient Platform for the Automatic Extraction of Patterns in Native Code." Scientific Programming 2017 (2017): 1–16. http://dx.doi.org/10.1155/2017/3273891.

Full text
Abstract:
Different software tools, such as decompilers, code quality analyzers, recognizers of packed executable files, authorship analyzers, and malware detectors, search for patterns in binary code. The use of machine learning algorithms, trained with programs taken from the huge number of applications in the existing open source code repositories, allows finding patterns not detected with the manual approach. To this end, we have created a versatile platform for the automatic extraction of patterns from native code, capable of processing big binary files. Its implementation has been parallelized, providing important runtime performance benefits for multicore architectures. Compared to the single-processor execution, the average performance improvement obtained with the best configuration is 3.5 factors over the maximum theoretical gain of 4 factors.
APA, Harvard, Vancouver, ISO, and other styles
44

Mishra, Ashutosh, and Vinayak Srivastava. "Extracting a knowledge from source code comprehesion using data mining methods." International Journal of Knowledge Engineering and Data Mining 2, no. 2/3 (2012): 174. http://dx.doi.org/10.1504/ijkedm.2012.051240.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Kanellopoulos, Y., C. Makris, and C. Tjortjis. "An improved methodology on information distillation by mining program source code." Data & Knowledge Engineering 61, no. 2 (May 2007): 359–83. http://dx.doi.org/10.1016/j.datak.2006.06.002.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Moulla, Donatien Koulla, Alain Abran, and Kolyang. "Duration Estimation Models for Open Source Software Projects." International Journal of Information Technology and Computer Science 13, no. 1 (February 8, 2021): 1–17. http://dx.doi.org/10.5815/ijitcs.2021.01.01.

Full text
Abstract:
For software organizations that rely on Open Source Software (OSS) to develop customer solutions and products, it is essential to accurately estimate how long it will take to deliver the expected functionalities. While OSS is supported by government policies around the world, most of the research on software project estimation has focused on conventional projects with commercial licenses. OSS effort estimation is challenging since OSS participants do not record effort data in OSS repositories. However, OSS data repositories contain dates of the participants’ contributions and these can be used for duration estimation. This study analyses historical data on WordPress and Swift projects to estimate OSS project duration using either commits or lines of code (LOC) as the independent variable. This study proposes first an improved classification of contributors based on the number of active days for each contributor in the development period of a release. For the WordPress and Swift OSS projects environments the results indicate that duration estimation models using the number of commits as the independent variable perform better than those using LOC. The estimation model for full-time contributors gives an estimate of the total duration, while the models with part-time and occasional contributors lead to better estimates of projects duration with both for the commits data and the lines of data.
APA, Harvard, Vancouver, ISO, and other styles
47

Haim, Mario, and Rodrigo Zamith. "Open-Source Trading Zones and Boundary Objects: Examining GitHub as a Space for Collaborating on “News”." Media and Communication 7, no. 4 (December 17, 2019): 80–91. http://dx.doi.org/10.17645/mac.v7i4.2249.

Full text
Abstract:
New actors, actants, and activities have entered journalism’s spaces in recent years. While this has raised the potential for the disruption of existing social orders, such heterogeneous assemblages also provide fruitful grounds for substantive innovation within “trading zones.” This article explores one such potential zone, the code-sharing platform GitHub, delineating the primary actors oriented around the boundary object of “news,” the objectives of their projects, the nature of their collaborations, and their use of software licenses. The analysis examines attributes of 88,776 news-oriented project repositories, with a smaller subsample subjected to a manual content analysis. Findings show that this trading zone consisted primarily of journalistic outsiders; repositories focused on technological solutions to distributional challenges and efforts that made journalism more transparent; that there was limited direct trade via the use of collaborative affordances on the platform; and that only a minority of repositories employed a permissive license favored by open-source advocates. This leads to a broader conclusion that while GitHub may be discursively important within journalism and certainly provides an avenue for actors to enter journalism’s periphery, it offers a limited pathway for those peripheral actors to move closer to the center of journalism. That, in turn, impacts the platform’s—and its users’—ability to reconfigure if not spur a reimagining of journalism’s meanings, conventions, and allocations of different forms of capital.
APA, Harvard, Vancouver, ISO, and other styles
48

Spinellis, Diomidis, Panos Louridas, and Maria Kechagia. "Software evolution: the lifetime of fine-grained elements." PeerJ Computer Science 7 (February 9, 2021): e372. http://dx.doi.org/10.7717/peerj-cs.372.

Full text
Abstract:
A model regarding the lifetime of individual source code lines or tokens can estimate maintenance effort, guide preventive maintenance, and, more broadly, identify factors that can improve the efficiency of software development. We present methods and tools that allow tracking of each line’s or token’s birth and death. Through them, we analyze 3.3 billion source code element lifetime events in 89 revision control repositories. Statistical analysis shows that code lines are durable, with a median lifespan of about 2.4 years, and that young lines are more likely to be modified or deleted, following a Weibull distribution with the associated hazard rate decreasing over time. This behavior appears to be independent from specific characteristics of lines or tokens, as we could not determine factors that influence significantly their longevity across projects. The programing language, and developer tenure and experience were not found to be significantly correlated with line or token longevity, while project size and project age showed only a slight correlation.
APA, Harvard, Vancouver, ISO, and other styles
49

Csordás, Anita, Amin Shahrokhi, Gergely Tóth, and Tibor Kovács. "Radiological Atmospheric Risk Modelling of NORM Repositories in Hungary." Atmosphere 13, no. 8 (August 17, 2022): 1305. http://dx.doi.org/10.3390/atmos13081305.

Full text
Abstract:
The human population is continuously exposed to natural radionuclides in environmental elements. The concentration of these nuclides is usually low, but different technological processes and activities can concentrate them in products, by-products, or wastes. These activities are, for example, coal mining, fertilizer production, ore mining, metal production, etc. These materials are labelled as NORM (Naturally Occurring Radioactive Material). The most common method of disposal for NORMs is deposition in different types of depositories. The long-term effects of these depositories on the environment and on human health are hard to estimate. The aim of the study is to assess radiation risk from the five selected NORM depositories (Ajka coal ash, Ajka red mud, Almásfüzitő red mud, Zalatárnok drilling mud, and Úrkút manganese residue) for members of the public and biota. The radionuclide concentrations were determined by HPGe gamma-spectrometry. The measured concentration was between 31 Bq/kg and 1997 Bq/kg for Ra-226, between 33 Bq/kg and 283 Bq/kg for Th-232, and between 48 Bq/kg and 607 Bq/kg for K-40. The dose estimation was investigated using RESRAD-ONSITE and RESRAD BIOTA, which are computer codes developed by the Argonne National Laboratory (USA). RESRAD-ONSITE can estimate the radiation risk from the radionuclides in the contaminated sites. The highest dose was observed in the case of the Ajka coal ash depository–without cover (12.38 mSv/y), and the lowest was in the case of Zalatárnok (0.53 mSv/y). The most significant contributors to the population dose are the uptakes through plants and external pathways, which account for more than 80% of the total dose on average. RESRAD-BIOTA code was used to estimate the radiation exposure of terrestrial organisms (plants and animals). During this work, the values of sum ratio factor (SRF), biota concentration guide (BCG), external dose, internal dose, and total dose were determined.
APA, Harvard, Vancouver, ISO, and other styles
50

DONG, JING, YAJING ZHAO, and TU PENG. "A REVIEW OF DESIGN PATTERN MINING TECHNIQUES." International Journal of Software Engineering and Knowledge Engineering 19, no. 06 (September 2009): 823–55. http://dx.doi.org/10.1142/s021819400900443x.

Full text
Abstract:
The quality of a software system highly depends on its architectural design. High quality software systems typically apply expert design experience which has been captured as design patterns. As demonstrated solutions to recurring problems, design patterns help to reuse expert experience in software system design. They have been extensively applied in the industry. Mining the instances of design patterns from the source code of software systems can assist in the understanding of the systems and the process of re-engineering them. More importantly, it also helps to trace back to the original design decisions, which are typically missing in legacy systems. This paper presents a review on current techniques and tools for mining design patterns from source code or design of software systems. We classify different approaches and analyze their results in a comparative study. We also examine the disparity of the discovery results of different approaches and analyze possible reasons with some insight.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography