< Back to previous page

Project

Scholarly Communications Data: Using Supervised and Unsupervised Machine learning Techniques towards fraud detection and enhancing quality

The last two decades have seen several trends impacting negatively on scientific integrity and quality scholarly communications. Most notable of these trends are so-called predatory journals combined with the rise of the open-access publication model, a pressure-to-publish research culture that incentivizes researchers for publication outputs, and the rise of China and India as large producers of research. So-called predatory publishers and conferences have been capitalizing on these trends, exploiting or coercing with scholars in unprecedented ways. Other concerning trends including plagiarism, retracted research, misleading or falsified data, and so forth. Secondary implications of these problems include the phenomenon of citation pollution where fake, retracted, or fraudulent research penetrate and distort legitimate science. Discussing the problem of predatory journals and its impact, Roberts (2016) exclaims that; “Entire fields of scientific research are now [be] susceptible to a pollution of the literature by unverified research or even fake articles published in fake journals being incorporated into legitimate meta-analyses.”The scientific community is doing better at correcting itself also, for example, by using vetted and/or accredited journal lists, journal blacklists, integrity inspectors, post-publication reviews, and so forth. However, to interrogate and mitigate the issues presented and specifically to detect scholarly fraud, supervised and unsupervised machine learning techniques can play an important role. In the scholarly ecosystem, data can represent, e.g. Big Data (real-time, scalable, etc. data), citation rates, researcher data, altmetrics (www.altmetric.com), output ratios, and inter-system connectivity. Data element identification and extraction can point to problems issues, trends, the need for context-related data, semantic data, and linked data in authorities, to prepare and contextualize results and to make them useful within the scholarly ecosystem. Important questions to consider, though include; what are the various data sets within the scholarly ecosystem that can be collated to serve as data for this study, what data can be converted into link-data for example, where to source the data, and how data can be useful in the scholarly ecosystem at different levels. The usefulness of data also means different things in terms of supervised and unsupervised machine learning techniques. A further major question is how to implement human-type thinking and interpretation that will allow for useful thematic groupings or analyses to emerge, specifically for human intervention (or machine intervention), to flag possible issues, opportunities, trends, etc. Machine intervention will imply automated integration with systems within the scholarly ecosystem, at the levels of both information and knowledge exchange levels. For example, machine learning is good in terms of pattern recognition. At the same time, human intervention implies more creative and contextual interpretation and application of data, to identify non-pattern type directions or opportunities. Therefore machine learning and more human type activities complement each other for a more complete understanding of fraud and quality in the scholarly ecosystem.A data perspective is also how policy and regulation inform or dictate data collation, trends, and perceived usefulness – as key aspects of scholarly behaviour. Aspects like organizational, political, regulatory factors contribute to data evolution. Furthermore, data should include an environmental scan, that should focus on current solutions. This will help identify trends and opportunities to integrate and adequately position activities within the scholarly ecosystem.On the machine learning level, the following problems are of relevance. Over 90% of online fraud detection platforms use transaction rules to analyze suspicious transactions. Surprisingly this traditional approach of using rules or logic statements to query transactions is still used by banks, payment systems, insurance companies, etc. The “rules” in this platform use a combination of data and horizon-scanning. The results of this process are generally binary labelling the transactions as authentic or fraudulent. The major disadvantage of the traditional process is the occurrence of false positives. This means completely normal customers just looking to make a purchase will go away from one’s business. There will be high false positives rates if the system rejects every transaction above a certain risk threshold. Thus (automatic and/or manual) reviews based on rules should be the last line of defence in the fraud detection strategy.Why should one use machine learning in fraud detection? Machines are much better than humans at processing large datasets. They can detect and recognize thousands of user behaviours (patterns) and indicate if these behaviours are normal or a variance based on an interpreted rules of behaviour. The reason why machine learning algorithms are used for preventing fraud is:Speed: In rule-based systems, people create ad hoc rules to determine which types of orders to accept or reject. This process is time-consuming and involves manual interaction. As the velocity of commerce is increasing, it’s crucial to have a quicker solution to detect fraud. Scale: Machine learning algorithms and models become more effective with increasing data sets. Whereas in rule-based models the cost of maintaining the fraud detection system multiplies as the customer base increases.Efficiency: Machine Learning can also help to avoid false positives. Moreover, unsupervised ML models can continuously analyze and process new data and then autonomously update its models to reflect the latest trends.The first objective of this research is to design a fraud detection platform for Predatory journals and other fraudulent activities within the scholarly communications ecosystem, that can process massive amounts of data from internal and external sources in real-time or in batch. Take advantage of a hybrid analytic approach across multiple techniques (automated business rules, predictive modelling, text mining, database lookups, exception reporting, network link analysis, etc.) to detect more fraud with greater accuracy. Advanced analytics with built-in AI and machine learning techniques, combined with traditional detection methods, can identify known and unknown patterns and evolve and adapt over time.State-of-the-art methods and techniques that will be used, include: -Data Preparation: Data preparation is a self-service activity that converts disparate, raw, messy data into a clean and consistent view. The process includes searching, cleaning, transforming, organizing and collecting data. Preparing data is critical but time-intensive; data teams spend up to 80% of their time converting raw data into high quality, analysis-ready output.-Anomaly Detection (Liang et al., 2017; Kumar et al., 2019; Hasani 2019; Habeeba et al., 2019; Thudumu et al., 2020; Pourhabibi et al., 2020): Anomaly detection (or outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically, anomalous data can be connected to some kind of problem or rare event such as, e.g. bank fraud, medical problems, structural defects, malfunctioning equipment, Predatory journals, etc. Machine learning and statistical analysis for anomaly detection (Multivariate Statistical Analysis, Deep Learning, Support Vector Machines, Ensemble methods, Random forests )-Time-series Analytics and Dynamic Limits: In general, clustering can be done on a static dataset, but expected behaviour changes over time. These behavioural patterns must also be accommodated to make sure the system performs well. It should review abnormal transactions while adjusting the expected range dynamically to accommodate the evolving nature of transactions over time.The second research objective relates to contextualizing and complementing of findings from the first objective. The first consideration would be to identify all data related to quality assurance, including fraud-related issues and trends. Secondly, to capture and organize/normalize with the assistance of both machine learning and human intervention – e.g. human flagging of data elements to be foregrounded. Thirdly, pattern analysis and interpretation are necessary, focusing on what can be automated and what still needs human intervention, and why these might be the case. This could highlight limitations concerning machine learning, limitations regarding data sourcing, systems design issues or system integration issues.Assessing the validity of research findings will involve and include thorough testing and contextualization, both from an automated machine learning perspective and from a more qualitative human orientated view (e.g. to interpret emerging contexts), addressing questions such as, do solutions work in terms of quality and accuracy, does it have an impact and what is the impact, relevance, usefulness, etc.? Longitudinal studies will have to focus on trends and trajectories in all of these aspects. Reasons for this include that fraud evolves. A case in point is that predatory journals are only successful until they are caught out or flagged by the community. Available criteria only represent historical experience.
Date:28 May 2021 →  12 Jul 2023
Keywords:Predatory Publishing, Scholarly ecosystem, Scholarly fraud, Machine learning
Disciplines:Machine learning and decision making
Project type:PhD project