Abstract - A key challenge in recommender system research is how to make recommendations to new users. Recently the idea of solving the problem within the context of learning user and item profiles has been proposed. Those methods constructed a decision tree for the initial interview, enabling the recommender to query a user adaptively according to her prior responses. However, those methods have overlooked the new users’ personal attributes. In this paper, we present the method CCFBURP, which constructs an algorithm with two steps, in the first of which we screen neighbors of the target user, using its personal attributes, while in the second of which we train the interview model on the dataset constituted of the neighbors and alternative projects. Then the recommender system forecasts goal of optional project ratings of the target user. Experimental results on the MovieLens dataset demonstrate that the proposed CCFBURP algorithm significantly outperforms existing methods for cold-start recommendation.

Keywords - Collaborative filtering, Recommender systems, Cold-start problem, User Registration Process

I. INTRODUCTION

Increasing people have declined to purchase interesting items from Internet, yet the boom of information relevant to customers, products and transactions has lead to information overload problem in E-Commerce[1]. Meanwhile, in order to supply customers with various personal services, personalized recommender systems with recommendation techniques have been widely applied, which have already been considered one of the most important methods of personal service in websites[2]. The developers of Tapestry (one of the first recommender systems) coined the phrase “collaborative filtering (CF),” which has been widely adopted in practice[3].

A key challenge for building an effective CF recommender system is the well-known cold-start problem - How to provide recommendations to new users? New users are unlikely to be given good recommendations for the lack of their rating or purchase history[4]. Pure collaborative filtering methods base their recommendations on community preferences (e.g., user ratings and purchase histories), which would ignore user and item attributes (e.g., demographics and product descriptions) [5, 6, 7, 8, 9].

A vast amount of methods have been introduced to solve the cold-start problem. Schein et al. proposed the aspect model latent variable method for cold- start recommendation, which combines both collaborative and content information in model fitting[10]. Kim and Li proposed a probabilistic model to address the cold-start problem, in which items are classified into groups, and predictions are made for users considering the Gaussian distribution of user ratings[11]. On the other hand, much of the collaborative filtering literature have focused on factor models recently, for instance, variants of the singular value decomposition (SVD)[12]. However, in code recommender systems, the available data consists of calls to methods in contexts. This data seems to be binary: given a context- method pair, it predicts whether or not the method is called within this context[13] .

A natural approach to solving the cold-start problem is to elicit new user preferences by querying users’ responses progressively through an initial interview process[14]. Specifically, at the visit of a new user, the recommender provides a seed item as a question and asks the user for her opinion; based on which the recommender gradually refines the user’s characterization that can provide more satisfactory recommendations to the user in the future. Ke Zhou et al. proposed the Functional Matrix Factorizations (FMF) method based on the thought of query responses and functional matrix factorizations[15]. Their method has been demonstrated to be effective in cold-start recommendation problems by experimental results.

The shortage of FMF is that it has overlooked the users’ personal attributes, for instance, their age, sex, vocation, location. In this paper, we present a method Cold-start Collaborative Filtering Based on User Registration Process(CCFBURP), i.e. a novel cold-start recommendation method that estimates users’ preferences through collecting information in their registration process. Our proposed method tends to explore the correlation between users more effectively . In CCFBURP, we construct a two-step algorithm, in the first of which we screen neighbors of the target user using its personal attributes, and in the second we train the interview model on the dataset constituted of the neighbors and alternative projects. Then the recommender system forecasts goal of optional project ratings of the target user. Experimental results on the MovieLens dataset demonstrate that the proposed CCFBURP algorithm significantly outperforms existing methods for cold-start recommendation.

The remainder of this paper is organized as follows: In Section 2, matrix factorization for collaborative filtering are introduced in the first place, followed by the presentation of functional matrix factorization for constructing the interview process of cold-start collaborative filtering by restricting the user profiles to be a function in the form of a decision tree. In Section 3, FMF is optimized with adopting users’ personal attributes, and then the progress of CCFBURP is constructed. Then, the evaluation of the proposed method on the MovieLens dataset is carried on and the results in Section 4 is analyzed. Finally in Section 5, the conclusion is drawn along with presenting several future directions.

II. FUNCTIONAL MATRIX FACTORIZATION DECISION TREE

In this section, we describe the functional matrix factorization (FMF) method for cold-start collaborative filtering which explores the well-known matrix factorization methods for constructing the interview process. The key innovation is that we parameterize user profiles to be a function of the responses to the possible questions of the interview process and use matrix factorization to compute the profiles.

A. Low Rank Approximation of the Rating Matrix

Consider tabulated data, organized in the observed matrix, which we seek to approximate by a product of two matrices .So we got

where corresponds to the rating of item j by user i.

Given the set of known ratings, the parameters and can be estimated through fitting the training data by solving the following optimization problem:

The problem can be solved by existing numerical optimization methods. In our implementation, we use the alternating optimization for its amenability for the cold-start settings. Specifically, the optimization process performs the following two updates alternatively.

First, for, minimizing with respect to with all , and all fixed:

it is a linear regression problem with squared loss. The closed form solution can be expressed as

Similarly, for , we can approximate with the closed form solution which can be expressed as

B. Functional Matrix Factorization

Now we consider constructing the interview process for cold-start collaborative filtering. Assume that a new user registers at the recommendation system and nothing is known about her. To capture the preferences of the user, the system initiates several interview questions to query the responses from the user. Based on the responses, the system constructs a profile for the user and provides recommendations accordingly. We propose to parameterize the user profile in such a way that the profile is tied to user i’s responses in the form of a function.

Assume there are P possible interview questions, and an answer to a question takes value in the finite set {-1,1,0}, representing “Dislike”, “Like” and “Unknown”, respectively. Then, we introduce a P-dimensional vector, which denotes the representing the answers of user i to the P questions. And we tie the profile to the answers by assuming , where is a function that maps the responses to the user profile .So we get .

Our goal is to learn both and from the observed ratings. To this end, substituting into the low rank matrix factorization model, we have the following optimization problem:

(1)

This objective function can be optimized through an alternating minimization process.

1. Given, we can compute by regularized least square regression.

This problem has a closed-form solution given by

(2)

where is the identity matrix of appropriate size.

2. Given, we try to fit a decision tree such that

(3)

To reduce the implementation and computational complexity, we address this problem by proposing an efficient greedy algorithm for finding an approximate solution.

C. Decision Tree Construction

Starting from the root node, the set of users at current node are partitioned into three disjoint subsets, and corresponding to “Like”, “Dislike” and “Unknown” of their responses to the interview question p:

To find the optimal question p that leads to the best split, we minimize the following objective:

(4)

where , and are the optimal profiles for users in the child nodes corresponds to the answers of “Like”, “Dislike” and “Unknown”, respectively:

After the root node is constructed, its child nodes can be constructed in a similar way, recursively. Finally, we can get the users’ profiles when he arrived in a leaf node of the decision tree. Then, the estimated rating of user i to item j can be expressed as .

III. COLD-START COLLABORATIVE FILTERING BASED ON USER REGISTRATION PROCESS

A. User Dissimilarity Matrix

FMF assumed that a new user who registered at the recommendation system was a black stranger before she answered the interview questions. However, thanks to her registration process, we do know something about her personal attributes, such as age, sex, vocation, location, etc. Generally, people with similar attributes are likely to share their interests in similar things. For example, most boys at the age of 15~20 likes The Pirates of the Caribbean and Harry Potter, while the men of 55~60 ages prefer Casablanca. Therefore, it will be more accurately for new users to recommend the resources which are enjoyed by the similar users, who have similar personal attributes with the new ones.

We introduced User Dissimilarity Matrix to measure the dissimilarity between two users. Assume there are p kinds of personal attributes in the user dataset, dissimilarity between and can be expressed as:

(5)

and corresponds to attribute f of and . If or is null value, or ==0, =0. In other cases, =1.

1.When attribute f is a dyadic scalar or a nominal variable.

If =,=0, or =1.

For dyadic scalars, the dissimilarity could be calculated through the simple matching method.

Where p corresponds to the number of all variables, and m corresponds to the number of variables that user i and j matched.

2.When attribute f is an interval variable.

Where h includes all non-null objects of attribute f.

For interval variables, the dissimilarity is usually calculated by distance between users. Euclidean distance has been most widely used:

In the user dissimilarity matrix, corresponds to the dissimilarity between user i and user j. Given a threshold, if user i and j could be a neighbor to each other.

B. Cold-start Collaborative Filtering Based on User Registration Process

To construct the Cold-start Collaborative Filtering Based on User Registration Process, the user dissimilarity matrix must be structured as a necessary preparation. And we should choose the appropriate measures of users’ attributes.

In the MovieLens dataset, users’ attributes were stored in table u.user, which had a tab separated list of user id, age, gender, occupation and zip code. We defined the variables in Eq. (5) as follows:

1.All =1,, for all users’ attributes are not null value.

2.If the the age difference between the two users is no more than 5, =0. If the difference is more than 5, =1.

3.If the two users’ genders are the same, =0, or =1.

4.If the two users’ occupations are the same, =0, or =1.

5. As we know, the first number of zip code in USA corresponds to several neighboring states, and we supposed that people who live in neighboring states have some kind of similarity. So we can present that if the two users’ zip codes start with a same number, =0. If not, =1.

We suggested a registration progress for new users, which consists of two parts. First, the new users registered an account and provided her personal information. Then we computed the dissimilarities between the new user and old ones. After that, we could find her neighbors from the existing users whose dissimilarities with her were less than or equal to the given. These neighbors’ rating to all items constituted her modelling data set, witch was denoted as. Afterwards the n items that had got the most ratings in were selected to be possible interview questions. Second, we trained the decision tree and through the progress mentioned in Section 3.

C. Computational Complexity

The computation complexity for constructing the decision tree of FMF in Section 2 is

(6)

where D is the depth of the tree, is the number of ratings by the user i. L represents the number of nodes in the tree, M is the number of possible interview questions, K is the dimension of the latent space. In all of these variables, D, , L and K can be assigned by administrator. M is the amount of all the items without further restrictions, which could be a really huge number.

In Section 3, we chose the top-N items as possible interview questions, which had got the most ratings from the neighbors of the new user. The computation complexity turned into

(7)

N Is far less than M under ordinary circumstances. For instance, M is 1682 in our MovieLens dataset. If we set N to be 150, the second part of the computation complexity reduced to less than 10%, as well as the third part to lsee than 1%. As the user number increased, the effect became more apparent.

IV. EXPERIMENTS

A. Data Set

The MovieLens data set has been widely used in the fild of CF. It contains 3900 movies, 6040 users and around 1 million ratings, the ratings are integers ranging from 1 (bad) to 5 (good). In our experiments, we choose the reduced data, which contains 1682 movies, 943 users, 100000 ratings, and each user in the reduced data has rated at least 20 movies.

We split the users into two disjoint subsets, the training set and the test set, containing 80% and 20% users. Then we split the items in the test set into two disjoint subsets, the answer set containing 80% items, which is used to generate the user responses in the interview process, while the evaluation set containing the rest 20% items, which is used to evaluate the performance after the interview process. The data set is divided as in Figure 1, where A represents the training set, B is the answer set, and C is the evaluation set.